<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Yangzhang</id>
	<title>edegan.com - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Yangzhang"/>
	<link rel="alternate" type="text/html" href="http://www.edegan.com/wiki/Special:Contributions/Yangzhang"/>
	<updated>2026-05-22T00:31:39Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20845</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20845"/>
		<updated>2017-10-18T15:59:28Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Deep Text Classifier */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
E:\McNair\Projects\Deep Text Classifier&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
We write in [https://www.tensorflow.org/ Tensorflow] for all the classifiers. [https://keras.io/ Keras] is a good wrapper over the Tensorflow framework to allow you quickly build up a neural network and train it. ( if you are new to Deep Learning and Tensorflow, please do stay with Keras. )&lt;br /&gt;
&lt;br /&gt;
* '''Embedding'''&lt;br /&gt;
 [https://keras.io/layers/embeddings/ Keras Official Documentation]&lt;br /&gt;
 [https://www.tensorflow.org/tutorials/word2vec Tensorflow : Vector Representations of Words]&lt;br /&gt;
 [https://en.wikipedia.org/wiki/Word2vec Wiki : Word2vec]&lt;br /&gt;
&lt;br /&gt;
* '''LSTM'''&lt;br /&gt;
 [http://colah.github.io/posts/2015-08-Understanding-LSTMs/ A Nice Blog about LSTM]&lt;br /&gt;
 [https://www.tensorflow.org/tutorials/recurrent Tensorflow : Recurrent Neural Networks]&lt;br /&gt;
 [https://keras.io/layers/recurrent/ Keras Official Documentation]&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20785</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20785"/>
		<updated>2017-10-12T20:46:00Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
We write in [https://www.tensorflow.org/ Tensorflow] for all the classifiers. [https://keras.io/ Keras] is a good wrapper over the Tensorflow framework to allow you quickly build up a neural network and train it. ( if you are new to Deep Learning and Tensorflow, please do stay with Keras. )&lt;br /&gt;
&lt;br /&gt;
* '''Embedding'''&lt;br /&gt;
 [https://keras.io/layers/embeddings/ Keras Official Documentation]&lt;br /&gt;
 [https://www.tensorflow.org/tutorials/word2vec Tensorflow : Vector Representations of Words]&lt;br /&gt;
 [https://en.wikipedia.org/wiki/Word2vec Wiki : Word2vec]&lt;br /&gt;
&lt;br /&gt;
* '''LSTM'''&lt;br /&gt;
 [http://colah.github.io/posts/2015-08-Understanding-LSTMs/ A Nice Blog about LSTM]&lt;br /&gt;
 [https://www.tensorflow.org/tutorials/recurrent Tensorflow : Recurrent Neural Networks]&lt;br /&gt;
 [https://keras.io/layers/recurrent/ Keras Official Documentation]&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20784</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20784"/>
		<updated>2017-10-12T20:45:22Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Model Training/Prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
We write in [https://www.tensorflow.org/ Tensorflow] for all the classifiers. [https://keras.io/ Keras] is a good wrapper over the Tensorflow framework to allow you quickly build up a neural network and train it. ( if you are new to Deep Learning and Tensorflow, please do stay with Keras. )&lt;br /&gt;
&lt;br /&gt;
* '''Embedding'''&lt;br /&gt;
 [https://keras.io/layers/embeddings/ Keras Official Documentation]&lt;br /&gt;
 [https://www.tensorflow.org/tutorials/word2vec Tensorflow : Vector Representations of Words]&lt;br /&gt;
 [https://en.wikipedia.org/wiki/Word2vec Wiki : Word2vec]&lt;br /&gt;
&lt;br /&gt;
* '''LSTM'''&lt;br /&gt;
 [http://colah.github.io/posts/2015-08-Understanding-LSTMs/ A Nice Blog about LSTM]&lt;br /&gt;
 [https://www.tensorflow.org/tutorials/recurrent Tensorflow : Recurrent Neural Networks]&lt;br /&gt;
 [https://keras.io/layers/recurrent/ Keras Official Documentation]&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20783</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20783"/>
		<updated>2017-10-12T20:44:39Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
We write in [https://www.tensorflow.org/ Tensorflow] for all the classifiers. [https://keras.io/ Keras] is a good wrapper over the Tensorflow framework to allow you quickly build up a neural network and train it. ( '''Suggestion:''' if you are new to Deep Learning and Tensorflow, please do stay with Keras. )&lt;br /&gt;
&lt;br /&gt;
* '''Embedding'''&lt;br /&gt;
 [https://keras.io/layers/embeddings/ Keras Official Documentation]&lt;br /&gt;
 [https://www.tensorflow.org/tutorials/word2vec Tensorflow : Vector Representations of Words]&lt;br /&gt;
 [https://en.wikipedia.org/wiki/Word2vec Wiki : Word2vec]&lt;br /&gt;
&lt;br /&gt;
* '''LSTM'''&lt;br /&gt;
 [http://colah.github.io/posts/2015-08-Understanding-LSTMs/ A Nice Blog about LSTM]&lt;br /&gt;
 [https://www.tensorflow.org/tutorials/recurrent Tensorflow : Recurrent Neural Networks]&lt;br /&gt;
 [https://keras.io/layers/recurrent/ Keras Official Documentation]&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20782</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20782"/>
		<updated>2017-10-12T20:35:43Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Model Training/Prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
We write in [https://www.tensorflow.org/ Tensorflow] for all the classifiers. [https://keras.io/ Keras] is a good wrapper over the Tensorflow framework to allow you quickly build up a neural network and train it. ( '''Suggestion:''' if you are new to Deep Learning and Tensorflow, please do stay with Keras. )&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20781</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20781"/>
		<updated>2017-10-12T20:34:18Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Model Training/Prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
We write in [https://www.tensorflow.org/ Tensorflow] for all the classifiers. [https://keras.io/ Keras] is a good wrapper over the Tensorflow framework to allow you quickly build up a neural network and train it. ('''Tip:''' If you are new to deep learning and Tensorflow, please do stay with Keras.)&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20780</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20780"/>
		<updated>2017-10-12T20:32:18Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Model Training/Prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
We write in [https://www.tensorflow.org/ Tensorflow] for all the classifiers. [https://keras.io/ Keras] is a good wrapper over the Tensorflow framework to allow you quickly build up a neural network and train it. If you are new to deep learning and Tensorflow, please do stay with Keras.&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20779</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20779"/>
		<updated>2017-10-12T20:30:58Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Model Training/Prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
We write in [https://www.tensorflow.org/ Tensorflow] for all the deep neural networks. [https://keras.io/ Keras] is a good wrapper over the Tensorflow framework to allow you quickly build up a neural network and train it. If you are new to deep learning and Tensorflow, please do stay with Keras.&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20778</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20778"/>
		<updated>2017-10-12T20:29:57Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Model Training/Prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
We write in [https://www.tensorflow.org/ Tensorflow] for the classifier. [https://keras.io/ Keras] is a good wrapper over the Tensorflow framework to allow you quickly build up a neural network and train it. If you are new to deep learning and Tensorflow, please do stay with Keras.&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20777</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20777"/>
		<updated>2017-10-12T20:23:25Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Model Training/Prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
I use [https://keras.io/ Keras] to build the entire neural network. Keras is an easy and quick way to write [https://www.tensorflow.org/ tensorflow] code.&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20776</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20776"/>
		<updated>2017-10-12T20:10:39Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do data preprocessing every time when you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20775</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20775"/>
		<updated>2017-10-12T20:06:09Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index. That is you should expect to see &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; these words with very small indices. Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage here is that you can easily ignore those very common and meaningless words, like &amp;quot;the&amp;quot;, by simply saying I only want to consider words with the indices &amp;gt; 20 for example. Notice that it's possible to encounter words that are not in our dictionary and we will alway assign them to index 1. These words are safe to ignore given that our dictionary is big enough.  &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20774</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20774"/>
		<updated>2017-10-12T19:49:56Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our advanced dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see &amp;quot;the, a, ...&amp;quot; these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index &amp;gt; 20 for example. And for any word that is not in our dictionary, code it with index 1, so again you can easily ignore it. &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20773</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20773"/>
		<updated>2017-10-12T19:49:09Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. For example, let's say &amp;quot;hello&amp;quot; is the 17th words in our dictionary and thus &amp;quot;hello&amp;quot; is encoded to 17. Our dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see &amp;quot;the, a, ...&amp;quot; these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index &amp;gt; 20 for example. And for any word that is not in our dictionary, code it with index 1, so again you can easily ignore it. &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20772</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20772"/>
		<updated>2017-10-12T19:47:55Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. (say &amp;quot;hello&amp;quot; is the 17th words in the dictionary, so &amp;quot;hello&amp;quot; -&amp;gt; 17) Our dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see &amp;quot;the, a, ...&amp;quot; these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index &amp;gt; 20 for example. And for any word that is not in our dictionary, code it with index 1, so again you can easily ignore it. &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20771</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20771"/>
		<updated>2017-10-12T19:46:22Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. Our dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see &amp;quot;the, a, ...&amp;quot; these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index &amp;gt; 20 for example. And for any word that is not in our dictionary, code it with index 1, so again you can easily ignore it. &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20770</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20770"/>
		<updated>2017-10-12T19:45:21Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) into integers. That is we need to build a dictionary mapping words to their corresponding indices. Our dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see &amp;quot;the, a, ...&amp;quot; these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index &amp;gt; 20 for example. And for any word that is not in our dictionary, code it with index 1, so again you can easily ignore it. &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20769</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20769"/>
		<updated>2017-10-12T19:44:51Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
Your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. One important step in data preprocessing is to encode words (strings) to integers. That is we need to build a dictionary mapping words to their corresponding indices. Our dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see &amp;quot;the, a, ...&amp;quot; these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index &amp;gt; 20 for example. And for any word that is not in our dictionary, code it with index 1, so again you can easily ignore it. &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20768</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20768"/>
		<updated>2017-10-12T19:43:23Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:'''&lt;br /&gt;
&lt;br /&gt;
your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. one important step in data preprocessing is to convert words (strings) to integers. That is we need to build a dictionary mapping words to their corresponding indices. Our dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see &amp;quot;the, a, ...&amp;quot; these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index &amp;gt; 20 for example. And for any word that is not in our dictionary, code it with index 1, so again you can easily ignore it. &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20767</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20767"/>
		<updated>2017-10-12T19:39:55Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:''' your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' &lt;br /&gt;
&lt;br /&gt;
1. one important step in data preprocessing is to convert words (strings) to integers. That is we need to build a dictionary mapping words to their corresponding indices. Our dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see &amp;quot;the, a, ...&amp;quot; these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index &amp;gt; 20 for example. And for any word that is not in our dictionary, code it with index 1, so again you can easily ignore it. &lt;br /&gt;
&lt;br /&gt;
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20766</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20766"/>
		<updated>2017-10-12T19:35:02Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:''' your input file (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name will be the same as your input. Please change the name properly to indicate the label information as have been discussed above. And place this pickle file under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
'''To advanced users:''' one important step in data preprocessing is to convert words (strings) to integers. That is we need to build a dictionary mapping words to their corresponding indices. Our dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see &amp;quot;the, a, ...&amp;quot; these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index &amp;gt; 20 for example.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20764</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20764"/>
		<updated>2017-10-12T19:19:10Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
'''To general users:''' your input (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name the same as your input. Please change the name properly to indicate the target label information as have been discussed above. And make sure this pickle file is under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20763</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20763"/>
		<updated>2017-10-12T19:18:45Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
# '''To general users:''' your input (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
The script will generate a pickle file with an &amp;quot;.pkl&amp;quot; extension and the name the same as your input. Please change the name properly to indicate the target label information as have been discussed above. And make sure this pickle file is under the same directory with your classification code, i.e. &amp;quot;classification_MMM_LLL.py&amp;quot;&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20762</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20762"/>
		<updated>2017-10-12T19:13:39Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
# '''To general users:''' your input (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
  1. &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  2. &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  3. &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  4. &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20761</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20761"/>
		<updated>2017-10-12T19:12:58Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
# '''To general users:''' your input (usually a single &amp;quot;.txt&amp;quot; file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate &amp;quot;.txt&amp;quot; files. To run the script, you basically need to specify the following:&lt;br /&gt;
  &amp;quot;File Name&amp;quot; : without the &amp;quot;.txt&amp;quot; extension,&lt;br /&gt;
  &amp;quot;Expected Columns&amp;quot; : total number of columns in the input file&lt;br /&gt;
  &amp;quot;Content Index&amp;quot; : the column index of the content &lt;br /&gt;
  &amp;quot;Label Index&amp;quot; : the column index of the label&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20760</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20760"/>
		<updated>2017-10-12T18:59:59Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset. &lt;br /&gt;
&lt;br /&gt;
# '''To general users:''' your input (usually a single &amp;quot;.txt&amp;quot; file contains many examples) will be split into a training set (80% by default) and a testing set (20% by default). The target labels you want to predict will be the sub-folder names. The description of each example will go into a separate &amp;quot;.txt&amp;quot; file and the name of the file can be determined by the user. To process your own dataset, you basically need to specify the file name, expected columns, content index and label index.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20759</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20759"/>
		<updated>2017-10-12T18:59:12Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
For the data preprocessing, we adopt the same standard as [http://ai.stanford.edu/~amaas/data/sentiment/ the IMDB]. &lt;br /&gt;
&lt;br /&gt;
# '''To general users:''' your input (usually a single &amp;quot;.txt&amp;quot; file contains many examples) will be split into a training set (80% by default) and a testing set (20% by default). The target labels you want to predict will be the sub-folder names. The description of each example will go into a separate &amp;quot;.txt&amp;quot; file and the name of the file can be determined by the user. To process your own dataset, you basically need to specify the file name, expected columns, content index and label index.&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20728</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20728"/>
		<updated>2017-10-10T22:31:15Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* About the Deep Models */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20727</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20727"/>
		<updated>2017-10-10T22:29:56Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* About the Deep Models */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both and as expected the RNN is more robust in facing different text classification tasks&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20726</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20726"/>
		<updated>2017-10-10T22:28:22Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* About the Deep Models */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks [https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20725</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20725"/>
		<updated>2017-10-10T22:27:18Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* About the Deep Models */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks [http://cs231n.github.io/convolutional-networks/ CNN] and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20724</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20724"/>
		<updated>2017-10-10T22:12:04Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20723</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20723"/>
		<updated>2017-10-10T21:48:47Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==General Guidelines for Tuning the Hyper-Parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20722</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20722"/>
		<updated>2017-10-10T21:48:04Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==Model Training/Prediction==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20721</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20721"/>
		<updated>2017-10-10T21:43:16Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : specify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20720</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20720"/>
		<updated>2017-10-10T21:42:16Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.&lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the name of the pickle file&lt;br /&gt;
&lt;br /&gt;
 with open('longdescription_ipo.pkl', 'rb') as file:&lt;br /&gt;
&lt;br /&gt;
* Step 2 : modify the total number of possible labels&lt;br /&gt;
&lt;br /&gt;
 model.add(Dense(2, activation='softmax'))&lt;br /&gt;
&lt;br /&gt;
* Step 3 : run the code&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM_ipo.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20719</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20719"/>
		<updated>2017-10-10T21:32:21Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 Attention: by default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate that we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. You need to name your files properly! This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples (the examples you don't see during the training) and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20718</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20718"/>
		<updated>2017-10-10T21:30:05Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
 By default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; to indicate we are predicting the industry areas and to &amp;quot;longdescriptions_ipo.pkl&amp;quot; to indicate we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. You need to name your files properly! This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples (the examples you don't see during the training) and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20717</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20717"/>
		<updated>2017-10-10T21:26:19Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
* Step 5 : give your pickle file a more reasonable name&lt;br /&gt;
&lt;br /&gt;
By default, the name of the pickle file is same as the original &amp;quot;.txt&amp;quot; file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name. For example, from &amp;quot;longdescriptions.pkl&amp;quot; to &amp;quot;longdescriptions_indu.pkl&amp;quot; given the input file name is &amp;quot;longdescriptions.txt&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. You need to name your files properly! This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples (the examples you don't see during the training) and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20716</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20716"/>
		<updated>2017-10-10T21:18:58Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. You need to name your files properly! This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples (the examples you don't see during the training) and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20715</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20715"/>
		<updated>2017-10-10T21:16:54Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. You need to name your file properly. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples (the examples you don't see during the training) and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20714</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20714"/>
		<updated>2017-10-10T21:13:10Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. This Python file, no matter what the model is, will always load in the pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples (the examples you don't see during the training) and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20711</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20711"/>
		<updated>2017-10-10T21:11:18Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
'''Data Preprocessing (preprocessing.py)''' : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
'''Model Training/Prediction (classification_MMM_LLL.py)''' : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text input we can predict for different things using the same model. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. This Python file, no matter what the model is, will always load in the pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples (the examples you don't see during the training) and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20710</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20710"/>
		<updated>2017-10-10T21:10:43Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot;Data Preprocessing (preprocessing.py)&amp;quot;&amp;quot;&amp;quot; : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot;Model Training/Prediction (classification_MMM_LLL.py)&amp;quot;&amp;quot;&amp;quot; : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text input we can predict for different things using the same model. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. This Python file, no matter what the model is, will always load in the pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples (the examples you don't see during the training) and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20709</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20709"/>
		<updated>2017-10-10T21:09:42Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
Data Preprocessing (preprocessing.py) : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
* Step 4 : run the code&lt;br /&gt;
&lt;br /&gt;
   python preprocessing.py &lt;br /&gt;
&lt;br /&gt;
Model Training/Prediction (classification_MMM_LLL.py) : this is where the deep neural network is. The &amp;quot;MMM&amp;quot; represents the model. For example, currently I have &amp;quot;1DConvolution&amp;quot;, &amp;quot;2DConvolution&amp;quot; and &amp;quot;LSTM&amp;quot;. &amp;quot;LLL&amp;quot; represents the name of the label. Notice that for the same text input we can predict for different things using the same model. For example, &amp;quot;classification_LSTM_indu.py&amp;quot; is a LSTM model to predict the industray based on the descriptions. And &amp;quot;classification_LSTM_ipo.py&amp;quot; is a LSTM model to predict the IPO status based on the same descriptions. This Python file, no matter what the model is, will always load in the pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples (the examples you don't see during the training) and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20708</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20708"/>
		<updated>2017-10-10T20:56:28Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
Data Preprocessing (preprocessing.py) : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
* Step 3 : specify the indices of the text and the label in &amp;quot;prepare_imdb_structure(file_name, expected_columns)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # the index of the label in the tokens&lt;br /&gt;
   label_index = 1&lt;br /&gt;
   # the index of the text in the tokens&lt;br /&gt;
   content_index = 4&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The second part of the code is where the deep neural network is. It will load in the pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20707</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20707"/>
		<updated>2017-10-10T20:52:23Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
Data Preprocessing (preprocessing.py) : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in &amp;quot;main()&amp;quot;&lt;br /&gt;
&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
&lt;br /&gt;
* Step 2 : specify the expected columns of your target file&lt;br /&gt;
&lt;br /&gt;
   # expected number of columns, in case we have &amp;quot;None&amp;quot; in the table&lt;br /&gt;
   expected_columns = 5&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The second part of the code is where the deep neural network is. It will load in the pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20706</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20706"/>
		<updated>2017-10-10T20:50:39Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
* Data Preprocessing (preprocessing.py) : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
* Step 1 : modify the target file name in main()&lt;br /&gt;
   # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
* Step 2 : specify the expected columns &lt;br /&gt;
&lt;br /&gt;
The second part of the code is where the deep neural network is. It will load in the pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20705</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20705"/>
		<updated>2017-10-10T20:49:32Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
The code contains two parts: Data Preprocessing and Model Training/Prediction.&lt;br /&gt;
&lt;br /&gt;
* Data Preprocessing (preprocessing.py) : this is where you transfer a text based &amp;quot;XXX.txt&amp;quot; input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction. &lt;br /&gt;
&lt;br /&gt;
 # modify the target file name in main()&lt;br /&gt;
    # don't add &amp;quot;.txt&amp;quot; extension&lt;br /&gt;
   file_name = 'ThicketDefCodingTestProcessed'&lt;br /&gt;
 # specify the expected columns &lt;br /&gt;
&lt;br /&gt;
The second part of the code is where the deep neural network is. It will load in the pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20704</id>
		<title>Deep Text Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Deep_Text_Classifier&amp;diff=20704"/>
		<updated>2017-10-10T20:34:02Z</updated>

		<summary type="html">&lt;p&gt;Yangzhang: /* How to Run the Code */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Deep Text Classifier=&lt;br /&gt;
&lt;br /&gt;
==Problem Description==&lt;br /&gt;
&lt;br /&gt;
We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description. &lt;br /&gt;
&lt;br /&gt;
==General Approach==&lt;br /&gt;
&lt;br /&gt;
We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words &amp;quot;Internet&amp;quot; and &amp;quot;High-tech&amp;quot; at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.&lt;br /&gt;
&lt;br /&gt;
==About the Deep Models==&lt;br /&gt;
&lt;br /&gt;
There are basically two big categories of deep neural networks - the convolutional neural networks (CNN) and the recurrent neural networks (RNN). The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one,  RNN, is in general for sequential information (i.e. language, video ...) based classification tasks.&lt;br /&gt;
&lt;br /&gt;
==Major Package Dependences==&lt;br /&gt;
&lt;br /&gt;
* Tensorflow https://www.tensorflow.org/&lt;br /&gt;
* Numpy http://www.numpy.org/&lt;br /&gt;
* Keras https://keras.io/&lt;br /&gt;
&lt;br /&gt;
==How to Run the Code==&lt;br /&gt;
&lt;br /&gt;
My code has been intentionally broken into two parts: data preprocessing and model training/prediction&lt;br /&gt;
&lt;br /&gt;
The first part of the code is all about data preprocessing which I will discuss later. But basically this is where you transform your single &amp;quot;XXX.txt&amp;quot; input file into a pickle file that the later part of the code can use for training and prediction. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python preprocessing.py&lt;br /&gt;
&lt;br /&gt;
The second part of the code is where the deep neural network is. It will load in the pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your testing examples and print the accuracy. To run this part:&lt;br /&gt;
&lt;br /&gt;
 python classification_LSTM.py&lt;br /&gt;
&lt;br /&gt;
Notice that the data preprocessing part usually only needs to be done once. The saved pickle file is basically a machine friendly code that can be loaded very fast.&lt;br /&gt;
&lt;br /&gt;
==Data Preprocessing==&lt;br /&gt;
&lt;br /&gt;
==How to Modify the Code to Solve your own problems==&lt;br /&gt;
&lt;br /&gt;
==General guidelines for tuning the hyper-parameters==&lt;/div&gt;</summary>
		<author><name>Yangzhang</name></author>
		
	</entry>
</feed>