Difference between revisions of "Maxine Tao (Work Log)"

From edegan.com
Jump to navigation Jump to search
Line 43: Line 43:
 
---------------------------------------------------------------------------------------------
 
---------------------------------------------------------------------------------------------
 
7/23 - I talked with Connor to work out a method for finding the URLs that we're missing. There is code available to crawl google, and I have modified some code to compute a "match score" between a URL and the company name. We can take the url with the highest score of the first 5 google results. Connor and I discussed that if the URL cannot be found within the first 5 results, then that company probably doesn't have a url at all.
 
7/23 - I talked with Connor to work out a method for finding the URLs that we're missing. There is code available to crawl google, and I have modified some code to compute a "match score" between a URL and the company name. We can take the url with the highest score of the first 5 google results. Connor and I discussed that if the URL cannot be found within the first 5 results, then that company probably doesn't have a url at all.
 +
 +
7/24 - I ran a test file through my two URL finder scripts, and determined that we could get around 50% accuracy. Also, the URLs that I could not find were usually invalid or foreign. I then cleaned the data of actual company names that we need and started running that.

Revision as of 16:28, 24 July 2018

Summer 2018

6/21 -- Downloaded Crunchbase data using API version 3.1, loaded 17 files into crunchbase2 database, checked each table to make sure specs matched new data and updated line counts. Grace and I ran into an issue with blank strings on date types. Date types with "" were not being read as null. We fixed this using a one-line command that we've written on Crunchbase Data. Later we used Connor's master list of 166 accelerators and tried to create a table with accelerators and their uuids by using the 'orgnizations' table. Some names matched multiple times and some did not match at all so we ended up with 179 matches, which we will clean through tomorrow.

6/22 -- Loaded Accelerator Master List as a table and matched on accelerator name or accelerator URL. Manually edited out bad results with same name and different URLs or different URLs and same names. There were 34 entries from the master accelerator list that could not be matched to anything in the crunchbase data table 'organizations'. Grace and I manually searched for these using ILIKE and found a number of matches that we added back into our spreadsheet of matches. Now we have a clean list of accelerator names, their matches from the crunchbase data, and their UUIDs.


6/25 -- Updated list of accelerators and their UUIDs with Connor and Grace (we now have 163 matches), created a table in database crunchbase2 called 'AccUUIDsFinal'. This is a list of 3 columns: accelerator names from the master list, accelerator names from crunchbase, accelerator UUIDs from crunchbase. Then we joined this table back to the needed info fields from crunchbase. This new table is called 'AccAllInfo'. From this table, joining accelerator UUIDs to company UUIDs does not work. This gives investors that have invested in accelerators. From this, Connor and I figured that company_name/company_uuid actually refers to the company being invested in. Joining accelerator names to investor names also gives nothing back. However, when I manually searched Y Combinator as an investor name, I got results back. Not sure what is going on - I think the accelerator names to investor names join should work.

6/26 -- Fixed yesterday's issue of no matches. The problem was that the investor_names field was surrounded with curly braces. I removed these and a clean version is saved in 'funding_rounds-no brackets.txt'. I found that matching accelerator UUIDs to investor UUIDs gives more matches than accelerator names to investor names. There are 631 matches, most of which are labeled as seed type investments.

6/27 -- Filled in a spreadsheet of the unique accelerators I got from yesterday's matches with flags indicating whether or not they take equity and notes about specifics. This is incomplete, there are some that I'm not sure about or couldn't find information for. Also helped Connor with manually filtering out duplicated company names. Helped Grace with LinkedIn crawler; it seems to work for founders that we have urls for but it crashes otherwise.

6/28 -- Worked with Minh and Grace to debug linkedin crawler. We had an issue with the xpath of the linkedin searchbox. Also helped Connor with filling in accelerator terms on master variable list. I filtered the list of accelerators and companies they've invested in by the investment amounts. If they match what is given on the website, I put them into a separate sheet under 'Accelerators and Investments'

6/29 -- I filtered out accelerator and investment matches that had the same data as the terms of joining given on the accelerator website. Then I took this data and matched it against the cohort list for companies without cohort years. I was only able to find 5 companies, which means this approach will not get us the data we want. After calling Ed, I matched a list of company names (from our data) to itself and a list of company names (from crunchbase) to itself. These two files have not been cleaned but they are in McNair\Software\Database Scripts\Crunchbase2 and have -MATCHED at the ends of their file names.


7/9 - Tried to understand the output of the matcher to understand the results from last week. After talking to Dylan and Connor, we decided to go through all of the matches from our data that were flagged as multiple matches. In a file called 'company name self matches', in the first sheet, orange highlights are a minor normalization difference, red highlights are most likely a duplicate, yellow highlights seem to be duplicates but I wasn't sure. I also inputted XXXX wherever there was a blank data field to prevent the matcher from shifting data weirdly into different columns.

7/10 - Tried to join crunchbase companies with the companies in the ‘cohort list new’ tab of the master list, most of them do match into crunchbase. I also learned that when pulling company names from crunchbase, there can be formatting issues. I talked with Connor and Dylan about building a master spreadsheet of all our company data. I created a table of companies, their UUIDs, and a count of the number of times the company appears in crunchbase. Then I was able to join this with a rough list of companies from our list that Connor gave me. There were about 350 companies where we needed to manually look up whether or not these companies exist in the “Duplicate Companies” spreadsheet; Connor and I completed this task together.

7/11 - Matched crunchbase companies to our companies and imported all UUIDs into a final list. I talked with Dylan and Connor to determine how to filter the results appropriately and only keep those that are the most accurate. All files and file locations are described in Crunchbase Data#Collecting Company Information.

7/12 - Looked at input data for Industry Classifier. When I tried to build test data from our final list compiled yesterday, I discovered some duplicate issues. Connor and I figured out which true duplicates to remove and have a final spreadsheet made called 'The File to Rule Them All'.

7/13 - Pulled descriptions and industry tags from crunchbase to match with the UUIDs we already have. Results of these tables are in the Industry Classifier update folder of Accelerators\Summer 2018. This morning, I read Christy and Yang's wiki project pages. To start, I tried to figure out how to best build a new coding system for the industry flags that are given in crunchbase. Looking at the old code in FinalIndustryClassifier_command.py, I can't find the file that builds the neural net. There are many versions floating around and I'm not sure which is the correct one.


7/16 - Figured out which file is capable of rewriting the Classifier.pkl file and how all the code and test files go together. I built a small training and test data set to work with, and I got IndustryClassifierCOPY.py to run on my data. I had to fix many index and key issues in parts of the code, which is not commented at all. With 10 industry categories and 970 training data points, I think the accuracy rate is around 30%. I tried to run the code on a bigger training data set, hoping that the accuracy rate would come up, but I got error messages back.

7/17 - Tried to run test data through FinalIndustryClassifier.py but it doesn't work even though the same file works in IndustryClassifierCOPY.py. The crunchbase descriptions are longer than the old ones from venturexpert, and I'm thinking that the accuracy rate may come up if I give the model more data and use these long descriptions. I talked with Wei and we tried to figure out the details of sklearn and the code together. See Industry Classifier for the exact files I've been using.

7/18 - Organized code into a new file called IndustryClassifierTEST.py. I cut out about half of the code, around 100 lines, in the original file that was repetitive/unused, and confirmed that my new file gives me the same results as the old file.

7/19 - The results do not improve with more training data. I then tried to use Yang's code with LSTM. The accuracy rate is still too low, and I'm trying to learn more about LSTM to see how to adjust the parameters.

7/20 - I have not yet figured out how to get Yang's code to run at 60% accuracy as his wiki page says. If I use less labels(5-7 instead of 40) with Christy's code(IndustryClassifierCONDENSED-USETHIS.py), the accuracy rate is around 50%. However, with 40 labels, the accuracy rate is only 25%. With Yang's code, less labels also increases the accuracy rate, but not to 60%. I also helped Minh fill out some of his test data for his Demo Day project.


7/23 - I talked with Connor to work out a method for finding the URLs that we're missing. There is code available to crawl google, and I have modified some code to compute a "match score" between a URL and the company name. We can take the url with the highest score of the first 5 google results. Connor and I discussed that if the URL cannot be found within the first 5 results, then that company probably doesn't have a url at all.

7/24 - I ran a test file through my two URL finder scripts, and determined that we could get around 50% accuracy. Also, the URLs that I could not find were usually invalid or foreign. I then cleaned the data of actual company names that we need and started running that.