Changes

Merging Existing Data with Crunchbase (view source)

Revision as of 10:22, 25 July 2018

147 bytes added , 10:22, 25 July 2018

E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx

We filter for companies (~~~4500~~4000) that did not receive VC, are not in crunchbase, and do not have URLs.

Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible. These are in:

E:\McNair\Projects\Accelerators\Summer 2018\url finder

To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.

It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.

Maxine.tao

145

edits

Changes

Merging Existing Data with Crunchbase (view source)

Revision as of 10:22, 25 July 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools