Changes

Jump to navigation Jump to search
no edit summary
We filter for companies (~4500) that did not receive VC, are not in crunchbase, and do not have URLs.
Using a Google crawler (STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible. These are in: E:\McNair\Projects\Accelerators\Summer 2018\url finder To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs. It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.
145

edits

Navigation menu