Changes

Jump to navigation Jump to search
no edit summary
{{Project|Has project output=Data|Has sponsor=McNair ProjectsCenter
|Has title=Merging Existing Data with Crunchbase
|Has owner=Connor Rothschild
Doing training data - 2,600 pages and are a little bit more than 1/2 way (~1500-1600).
 
==Finding Company URLs==
In this file (sheet: 'Most Recent Merged Data'):
E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx
 
We filter for companies (~4500) that did not receive VC, are not in crunchbase, and do not have URLs.
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible. These are in:
E:\McNair\Projects\Accelerators\Summer 2018\url finder
 
To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.
 
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.

Navigation menu