Changes

Industry Classifier (view source)

Revision as of 14:37, 17 February 2017

739 bytes added , 14:37, 17 February 2017

no edit summary

===FindTrainData.py===

Builds a tab-delimited text file containing 200 companies with each Industry classification (i.e. 200 biotech, 200 media etc). Hopefully if we use this as our training data, we will get more accurate classifications.

==FixDescriptions.py==

Deals with the problem that by output files from SDC are poorly formatted when the description goes beyond 1 line. Outputs a tab-delimited text file where the whole description is on the same line and can be read.

===Addresses.txt===

This text file contains investment info, name, address, city, state of Portfolio companies.

===Descriptions.txt===

See an example [https://exceljet.net/formula/count-matches-between-two-columns here].

=Comments and Thoughts=

'''2/17/17'''

Christy: No matter what parameters I change in the NN, I can't get the accuracy to go up above around 30%. Looking at the descriptions that the classifier fails on, I realized that it pretty much guesses randomly a lot of the time when the descriptions are terrible like "We provide services to our customers." I think we need to be training and classifying based on the longer description, which is why I started working on the FixDescriptions.txt script.

ChristyW

272

edits

Changes

Industry Classifier (view source)

Revision as of 14:37, 17 February 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools