Changes

2,671 bytes added , 13:47, 21 September 2020

no edit summary

{{Project

|Has project output=Tool

|Has sponsor=McNair Center

|Has title=Deep Text Classifier

|Has owner=Yang Zhang,

|Has start date=September 2017

|Has keywords=Tool

|Has project status=Active

|Does subsume=Industry Classifier,

}}

=Deep Text Classifier=

E:\McNair\Projects\Deep Text Classifier

==Problem Description==

==Data Preprocessing==

For ~~the~~ data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ ~~the~~ IMDB]dataset. '''To general users:''' Your input file (usually a single ".txt" file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate ".txt" files. To run the script, you basically need to specify the following: 1. "File Name" : without the ".txt" extension, 2. "Expected Columns" : total number of columns in the input file 3. "Content Index" : the column index of the content 4. "Label Index" : the column index of the label The script will generate a pickle file with an ".pkl" extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. "classification_MMM_LLL.py" '''To advanced users:'''

~~# '''To general users:''' your input~~ 1. One important step in data preprocessing is to encode words (~~usually a single ".txt" file contains many examples~~strings) ~~will be split~~ into ~~a training set (80% by default) and a testing set (20% by default)~~integers. The ~~target labels you want~~ solution is to build a dictionary mapping words to ~~predict will be the sub-folder names~~their corresponding indices. ~~The description of each~~ For example ~~will go into a separate~~ , let's say "~~.txt~~hello" ~~file~~ is the 17th words in our dictionary and thus "hello" is encoded to 17. Our advanced dictionary is ordered by the ~~name of~~ words' frequency. Higher the ~~file can be determined by~~ frequency smaller the ~~user~~index. ~~To process your own dataset,~~ That is you ~~basically need~~ should expect to ~~specify~~ see "the ~~file name~~" and "a" these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, ~~expected columns~~like "the", ~~content index~~ by simply saying I only want to consider words with the indices > 20 for example. Notice that it's possible to encounter words that are not in our dictionary and ~~label~~ we will alway assign them to index1. These words are safe to ignore given that our dictionary is big enough. 2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.

==Model Training/Prediction==

~~==General Guidelines~~ We write in [https://www.tensorflow.org/ Tensorflow] for ~~Tuning~~ all the classifiers. [https://keras.io/ Keras] is a good wrapper over the ~~Hyper~~Tensorflow framework to allow you quickly build up a neural network and train it. ( if you are new to Deep Learning and Tensorflow, please do stay with Keras. ) * '''Embedding''' [https://keras.io/layers/embeddings/ Keras Official Documentation] [https://www.tensorflow.org/tutorials/word2vec Tensorflow : Vector Representations of Words] [https://en.wikipedia.org/wiki/Word2vec Wiki : Word2vec] * '''LSTM''' [http://colah.github.io/posts/2015-~~Parameters~~08-Understanding-LSTMs/ A Nice Blog about LSTM] [https://www.tensorflow.org/tutorials/recurrent Tensorflow : Recurrent Neural Networks] [https://keras.io/layers/recurrent/ Keras Official Documentation] ==Summer 2018 Work==Code, data, and attempts to run are located in: E:

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Deep Text Classifier (view source)

Revision as of 13:47, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools