Difference between revisions of "Industry Classifier"
Peterjalbert (talk | contribs) |
|||
(49 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
− | {{McNair | + | {{Project |
− | | | + | |Has project output=Tool |
− | | | + | |Has sponsor=McNair Center |
+ | |Has title=Industry Classifier | ||
+ | |Has owner=Christy Warden, | ||
+ | |Has start date=Spring 2017 | ||
+ | |Has keywords=Tool | ||
+ | |Has project status=Subsume | ||
}} | }} | ||
+ | The objective of this project is to build a neural network that can classify a firm's industry based on text of its business description. | ||
+ | The following projects are dependent on this projects: {{#ask: [[Category:McNair Projects]] [[Is dependent on::{{PAGENAME}}]]}} | ||
+ | |||
+ | =Summer 2018 Work= | ||
+ | Test data will come from crunchbase. | ||
+ | Database is called crunchbase2 and is located in: | ||
+ | /bulk/crunchbase2 | ||
+ | The pulled information is in: | ||
+ | E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier update\Our companies with other info.xlsx | ||
+ | The code to build tables to pull all info is in: | ||
+ | E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier update\BuildTestData.sql | ||
+ | |||
+ | ==MLP Classifier== | ||
+ | The new version that I am editing on is: | ||
+ | E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier update\IndustryClassifierCONDENSED-USETHIS.py | ||
+ | Small training and testing data is called: | ||
+ | 2018traindata.txt | ||
+ | NewTestData2018.txt | ||
+ | Larger training and testing data is called: | ||
+ | bigtrain2018.txt | ||
+ | bigtest2018.txt | ||
+ | This file modifies the Classifier.pkl file which stores the components of the model. Eventually, we should be able to run this through FinalIndustryClassifier.py. | ||
+ | |||
+ | The crunchbase data in my training data has almost 40 labels and I could not get the accuracy rate of this model to go up past 30%. However, if you assign only 3 labels, the accuracy rate goes up to 50% | ||
+ | |||
+ | ==LSTM Model== | ||
+ | See old page here [[Deep Text Classifier]]. I updated the preprocessing file to run on python3. | ||
+ | |||
+ | I tried updating this code to run on the new data from Crunchbase. Files used are located in: | ||
+ | E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier update\Yang's Code | ||
+ | |||
+ | You should first run the preprocessing file and then use the classification file. I could not figure out why the accuracy on this model was only 10% with 40 labels and around 30% with 5-8 labels. The accuracy of this one should be higher than the MLP classifier. | ||
+ | |||
+ | =New Notes= | ||
+ | |||
+ | We're rebuilding the [[Industry Classifier]] using better technology and better inputs. | ||
+ | |||
+ | For the inputs: | ||
+ | *Run LoadLongDescription.sql in Z:\VentureCapitalData\SDCVCData\vcdb2 | ||
+ | *With sdccompanybase1 table already loaded, load the commented code in that file too | ||
+ | *This outputs longdescriptionindu.txt | ||
+ | |||
+ | =Final Product and Use= | ||
+ | |||
+ | ==Description== | ||
+ | |||
+ | The final product (as of 2/27/17) is FinalIndustryClassifier.py which is located in McNair/Projects/Accelerators/Industry_Classifier. | ||
+ | It takes in an input file of the format Company tab Description and outputs a file called inputfile + Classified.txt. (So if you input Myfile.txt, your output file will be | ||
+ | MyfileClassified.txt). This file will be located in the same folder as the FinalIndustry.py code (McNair/Projects/Accelerators/Industry_Classifier). | ||
+ | |||
+ | ==Use== | ||
+ | |||
+ | 1) Create a file of the format Company [tab] Description. The description must all be on one line. | ||
+ | |||
+ | 2) Copy your file into the folder McNair/Projects/Accelerators/Industry_Classifier | ||
+ | |||
+ | 3) Open the file FinalIndustryClassifier.py in Komodo | ||
+ | |||
+ | 4) On line 7 of the code, change the words inside the quotation marks to the name of your file. For example, if your file is called MyFile.txt, line 7 should read myfile = "MyFile.txt" | ||
+ | |||
+ | 5) Press the play button and wait for "Done!" to print in the output window of Komodo. | ||
+ | |||
+ | 6) Open McNair/Projects/Accelerators/Industry_Classifier and find the file called "(the name of your file)Classified.txt" (aka MyFileClassified.txt) | ||
+ | |||
+ | 7) Open this file (IN TEXTPAD). It should be your output of the format Company [tab] Classification. | ||
+ | |||
+ | ==Command Line Use== | ||
+ | A command line program exists for this tool. To use it, open the Command Prompt and change directories to: | ||
+ | E:\McNair\Projects\Accelerators\Industry_Classifier | ||
+ | To run the program, enter: | ||
+ | python FinalIndustryClassifier_command.py | ||
+ | A prompt will appear asking you to enter an F or S. F stands for File Input, and S stands for Single Use. | ||
+ | If you select F, a prompt will appear asking you to enter an input filename, and an output filename, separated by a space. | ||
=Possible Tools= | =Possible Tools= | ||
Line 23: | Line 101: | ||
It's complexity is listed as: Suppose there are n training samples, m features, k hidden layers, each containing h neurons - for simplicity, and o output neurons. The time complexity of backpropagation is O(n * m * h^k * o * i), where i is the number of iterations. Since backpropagation has a high time complexity, it is advisable to start with smaller number of hidden neurons and few hidden layers for training. | It's complexity is listed as: Suppose there are n training samples, m features, k hidden layers, each containing h neurons - for simplicity, and o output neurons. The time complexity of backpropagation is O(n * m * h^k * o * i), where i is the number of iterations. Since backpropagation has a high time complexity, it is advisable to start with smaller number of hidden neurons and few hidden layers for training. | ||
+ | |||
+ | ------WE ENDED UP USING THIS ONE | ||
+ | ------ | ||
Line 47: | Line 128: | ||
This is a neural net built in python that trains on industry designation data from the SDC Platinum database. It serves as a predictive model to predict the industry allocation of given companies. | This is a neural net built in python that trains on industry designation data from the SDC Platinum database. It serves as a predictive model to predict the industry allocation of given companies. | ||
The file is located in the directory listed above. | The file is located in the directory listed above. | ||
+ | |||
+ | ===FindTrainData.py=== | ||
+ | Builds a tab-delimited text file containing 200 companies with each Industry classification (i.e. 200 biotech, 200 media etc). Hopefully if we use this as our training data, we will get more accurate classifications. | ||
+ | |||
+ | ==FixDescriptions.py== | ||
+ | Deals with the problem that by output files from SDC are poorly formatted when the description goes beyond 1 line. Outputs a tab-delimited text file where the whole description is on the same line and can be read. | ||
===Addresses.txt=== | ===Addresses.txt=== | ||
This text file contains investment info, name, address, city, state of Portfolio companies. | This text file contains investment info, name, address, city, state of Portfolio companies. | ||
+ | |||
===Descriptions.txt=== | ===Descriptions.txt=== | ||
Line 63: | Line 151: | ||
[https://en.wikipedia.org/wiki/Precision_and_recall Precision and Recall] | [https://en.wikipedia.org/wiki/Precision_and_recall Precision and Recall] | ||
+ | |||
+ | Quick Check using excel; Finding number of correct matches between two columns: | ||
+ | |||
+ | =SUMPRODUCT(--(range1=range2)) | ||
+ | |||
+ | See an example [https://exceljet.net/formula/count-matches-between-two-columns here]. | ||
+ | |||
+ | |||
+ | =Comments and Thoughts= | ||
+ | |||
+ | '''2/17/17''' | ||
+ | |||
+ | Christy: No matter what parameters I change in the NN, I can't get the accuracy to go up above around 30%. Looking at the descriptions that the classifier fails on, I realized that it pretty much guesses randomly a lot of the time when the descriptions are terrible like "We provide services to our customers." I think we need to be training and classifying based on the longer description, which is why I started working on the FixDescriptions.txt script. | ||
+ | |||
+ | |||
+ | '''2/27/17''' | ||
+ | |||
+ | Christy: The pickle library is vital and we should remember to use it when we use black boxish libraries like the sklearn classifier. |
Latest revision as of 13:47, 21 September 2020
The objective of this project is to build a neural network that can classify a firm's industry based on text of its business description.
Industry Classifier | |
---|---|
Project Information | |
Has title | Industry Classifier |
Has owner | Christy Warden |
Has start date | Spring 2017 |
Has deadline date | |
Has keywords | Tool |
Has project status | Subsume |
Dependent(s): | Accelerator Seed List (Data), U.S. Seed Accelerators |
Subsumed by: | Deep Text Classifier |
Has sponsor | McNair Center |
Has project output | Tool |
Copyright © 2019 edegan.com. All Rights Reserved. |
The following projects are dependent on this projects:
Summer 2018 Work
Test data will come from crunchbase. Database is called crunchbase2 and is located in:
/bulk/crunchbase2
The pulled information is in:
E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier update\Our companies with other info.xlsx
The code to build tables to pull all info is in:
E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier update\BuildTestData.sql
MLP Classifier
The new version that I am editing on is:
E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier update\IndustryClassifierCONDENSED-USETHIS.py
Small training and testing data is called:
2018traindata.txt NewTestData2018.txt
Larger training and testing data is called:
bigtrain2018.txt bigtest2018.txt
This file modifies the Classifier.pkl file which stores the components of the model. Eventually, we should be able to run this through FinalIndustryClassifier.py.
The crunchbase data in my training data has almost 40 labels and I could not get the accuracy rate of this model to go up past 30%. However, if you assign only 3 labels, the accuracy rate goes up to 50%
LSTM Model
See old page here Deep Text Classifier. I updated the preprocessing file to run on python3.
I tried updating this code to run on the new data from Crunchbase. Files used are located in:
E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier update\Yang's Code
You should first run the preprocessing file and then use the classification file. I could not figure out why the accuracy on this model was only 10% with 40 labels and around 30% with 5-8 labels. The accuracy of this one should be higher than the MLP classifier.
New Notes
We're rebuilding the Industry Classifier using better technology and better inputs.
For the inputs:
- Run LoadLongDescription.sql in Z:\VentureCapitalData\SDCVCData\vcdb2
- With sdccompanybase1 table already loaded, load the commented code in that file too
- This outputs longdescriptionindu.txt
Final Product and Use
Description
The final product (as of 2/27/17) is FinalIndustryClassifier.py which is located in McNair/Projects/Accelerators/Industry_Classifier. It takes in an input file of the format Company tab Description and outputs a file called inputfile + Classified.txt. (So if you input Myfile.txt, your output file will be MyfileClassified.txt). This file will be located in the same folder as the FinalIndustry.py code (McNair/Projects/Accelerators/Industry_Classifier).
Use
1) Create a file of the format Company [tab] Description. The description must all be on one line.
2) Copy your file into the folder McNair/Projects/Accelerators/Industry_Classifier
3) Open the file FinalIndustryClassifier.py in Komodo
4) On line 7 of the code, change the words inside the quotation marks to the name of your file. For example, if your file is called MyFile.txt, line 7 should read myfile = "MyFile.txt"
5) Press the play button and wait for "Done!" to print in the output window of Komodo.
6) Open McNair/Projects/Accelerators/Industry_Classifier and find the file called "(the name of your file)Classified.txt" (aka MyFileClassified.txt)
7) Open this file (IN TEXTPAD). It should be your output of the format Company [tab] Classification.
Command Line Use
A command line program exists for this tool. To use it, open the Command Prompt and change directories to:
E:\McNair\Projects\Accelerators\Industry_Classifier
To run the program, enter:
python FinalIndustryClassifier_command.py
A prompt will appear asking you to enter an F or S. F stands for File Input, and S stands for Single Use. If you select F, a prompt will appear asking you to enter an input filename, and an output filename, separated by a space.
Possible Tools
Python Tools
SciKit Learn SVM
http://scikit-learn.org/stable/modules/svm.html#svm
It's complexity is between O(n^2) and O(n^3). Seems easy to use. This is not a neural net; it is a support vector machine.
SciKit Learn Neural Net
http://scikit-learn.org/stable/modules/neural_networks_supervised.html
This IS a neural net using back propagation.
It's complexity is listed as: Suppose there are n training samples, m features, k hidden layers, each containing h neurons - for simplicity, and o output neurons. The time complexity of backpropagation is O(n * m * h^k * o * i), where i is the number of iterations. Since backpropagation has a high time complexity, it is advisable to start with smaller number of hidden neurons and few hidden layers for training.
WE ENDED UP USING THIS ONE
SK Neural Network Package
This is a separate package than listed above. It requires a separate installation. Documentation is provided at:
https://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
We ran into deprecation warnings, and the program would not execute due to a missing g++ drive.
R Tools
R seems to have a built in package called "neuralnet".
An example is given at:
https://www.packtpub.com/books/content/training-and-visualizing-neural-network-r
Scripts
Scripts and data for this project are located in:
E:\McNair\Projects\Accelerators\Code+Final_Data\ChristyCode
Industry Classifier
This is a neural net built in python that trains on industry designation data from the SDC Platinum database. It serves as a predictive model to predict the industry allocation of given companies. The file is located in the directory listed above.
FindTrainData.py
Builds a tab-delimited text file containing 200 companies with each Industry classification (i.e. 200 biotech, 200 media etc). Hopefully if we use this as our training data, we will get more accurate classifications.
FixDescriptions.py
Deals with the problem that by output files from SDC are poorly formatted when the description goes beyond 1 line. Outputs a tab-delimited text file where the whole description is on the same line and can be read.
Addresses.txt
This text file contains investment info, name, address, city, state of Portfolio companies.
Descriptions.txt
This text file contains company, short description, major industry, minor industry of Portfolio companies.
Statistics
Stastical methods for analyzing results from a neural network.
Quick Check using excel; Finding number of correct matches between two columns:
=SUMPRODUCT(--(range1=range2))
See an example here.
Comments and Thoughts
2/17/17
Christy: No matter what parameters I change in the NN, I can't get the accuracy to go up above around 30%. Looking at the descriptions that the classifier fails on, I realized that it pretty much guesses randomly a lot of the time when the descriptions are terrible like "We provide services to our customers." I think we need to be training and classifying based on the longer description, which is why I started working on the FixDescriptions.txt script.
2/27/17
Christy: The pickle library is vital and we should remember to use it when we use black boxish libraries like the sklearn classifier.