Changes

Jump to navigation Jump to search
2,808 bytes added ,  16:51, 3 August 2018
no edit summary
7/17 - Tried to run test data through FinalIndustryClassifier.py but it doesn't work even though the same file works in IndustryClassifierCOPY.py. The crunchbase descriptions are longer than the old ones from venturexpert, and I'm thinking that the accuracy rate may come up if I give the model more data and use these long descriptions. I talked with Wei and we tried to figure out the details of sklearn and the code together. See [[Industry Classifier]] for the exact files I've been using.
7/18 - Organized code into a new file called IndusryClassifierTESTIndustryClassifierTEST.py. I cut out about half of the code, around 100 lines, in the original file that was repetitive/unused, and confirmed that my new file gives me the same results as the old file. 7/19 - The results do not improve with more training data. I then tried to use Yang's code with LSTM. The accuracy rate is still too low, and I'm trying to learn more about LSTM to see how to adjust the parameters. 7/20 - I have not yet figured out how to get Yang's code to run at 60% accuracy as his wiki page says. If I use less labels(5-7 instead of 40) with Christy's code(IndustryClassifierCONDENSED-USETHIS.py), the accuracy rate is around 50%. However, with 40 labels, the accuracy rate is only 25%. With Yang's code, less labels also increases the accuracy rate, but not to 60%. I also helped Minh fill out some of his test data for his Demo Day project.  ---------------------------------------------------------------------------------------------7/23 - I talked with Connor to work out a method for finding the URLs that we're missing. There is code available to crawl google, and I have modified some code to compute a "match score" between a URL and the company name. We can take the url with the highest score of the first 5 google results. Connor and I discussed that if the URL cannot be found within the first 5 results, then that company probably doesn't have a url at all. 7/24 - I ran a test file through my two URL finder scripts, and determined that we could get around 50% accuracy. Also, the URLs that I could not find were usually invalid or foreign. I then cleaned the data of actual company names that we need and started running that. 7/25 - while my URL finder was running, I helped Minh fill in Demo Day training data info.  7/26 - I helped Minh fill in Demo Day training data info, sat in on a conference call with Hira and Ed, and worked on processing the results from my URL finder.  7/27 - Cleaned results from my URL finder and added them into 'The File to Rule Them All'. I also helped Connor find addresses for accelerators, and I updated wiki pages to describe my URL finder work. ---------------------------------------------------------------------------------------------7/30 - Updated wiki pages for Google URL Finder (http://mcnair.bakerinstitute.org/wiki/U.S._Seed_Accelerators#Finding_Company_URLs) and started running Whois Parser 7/31 - Worked with [[Minh Le]] to build the [[Seed DB Parser]]. Helped Connor recode Founder's job experience 8/1 - Filtered and organized data from seed-db crawl, resulting in data for 257 more companies. Of these companies, only 100 resulted in new info that we didn't already have. I also helped [[Grace Tan]] with filling in data for the minor code mapping. 8/2 - Modified my URL finder and reran it on the crawl results to get about 200 more URLs, which I placed in The File to Rule Them All 8/3 - Helped Connor manually add timing info for companies for which we could not find timing data from other sources
145

edits

Navigation menu