Changes

2,404 bytes added , 13:47, 21 September 2020

no edit summary

{{Project|Has project output=Tool|Has sponsor=McNair ~~Projects~~Center

|Has title=Industry Classifier

|Has owner=Christy Warden,

|Has start date=Spring 2017

|Has keywords=Tool|Has project status=~~Active~~Subsume

}}

The objective of this project is to build a neural network that can classify a firm's industry based on text of its business description.

~~This project will be used in the~~ The following projects are dependent on this projects: {{#ask: [[~~Accelerator Seed List (Data)~~Category:McNair Projects]] ~~project.~~[[Is dependent on::{{PAGENAME}}]]}}

=Summer 2018 Work=Test data will come from crunchbase.Database is called crunchbase2 and is located in: /bulk/crunchbase2 The pulled information is in: E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier update\Our companies with other info.xlsxThe ~~following projects depend upon this project~~code to build tables to pull all info is in: E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier update\BuildTestData.sql ==MLP Classifier==~~{{#ask~~The new version that I am editing on is: ~~[[Category~~ E:\McNair \Projects\Accelerators\Summer 2018\Industry Classifier update\IndustryClassifierCONDENSED-USETHIS.pySmall training and testing data is called: 2018traindata.txt NewTestData2018.txtLarger training and testing data is called: bigtrain2018.txt bigtest2018.txtThis file modifies the Classifier.pkl file which stores the components of the model. Eventually, we should be able to run this through FinalIndustryClassifier.py. The crunchbase data in my training data has almost 40 labels and I could not get the accuracy rate of this model to go up past 30%. However, if you assign only 3 labels, the accuracy rate goes up to 50% ==LSTM Model==See old page here [[Deep Text Classifier]]. I updated the preprocessing file to run on python3. ~~[[Is dependent~~ I tried updating this code to run onthe new data from Crunchbase. Files used are located in: E:\McNair\Projects\Accelerators\Summer 2018\Industry Classifier]]update\Yang's Code You should first run the preprocessing file and then use the classification file. I could not figure out why the accuracy on this model was only 10% with 40 labels and around 30% with 5-8 labels. The accuracy of this one should be higher than the MLP classifier. ~~| mainlabel~~=~~Project Title~~ }}New Notes=

We're rebuilding the [[Industry Classifier]] using better technology and better inputs.

For the inputs:

*Run LoadLongDescription.sql in Z:\VentureCapitalData\SDCVCData\vcdb2

*With sdccompanybase1 table already loaded, load the commented code in that file too

*This outputs longdescriptionindu.txt

=Final Product and Use=

7) Open this file (IN TEXTPAD). It should be your output of the format Company [tab] Classification.

==Command Line Use==

A command line program exists for this tool. To use it, open the Command Prompt and change directories to:

E:\McNair\Projects\Accelerators\Industry_Classifier

To run the program, enter:

python FinalIndustryClassifier_command.py

A prompt will appear asking you to enter an F or S. F stands for File Input, and S stands for Single Use.

If you select F, a prompt will appear asking you to enter an input filename, and an output filename, separated by a space.

=Possible Tools=

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Industry Classifier (view source)

Revision as of 13:47, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools