|Does subsume=Accelerator Data, Accelerator Seed List (Data),
}}
<onlyinclude>The [[U.S. Seed Accelerators]] project subsumes several related projects. These projects were intended to assemble near-population data on high-growth high-tech seed accelerators in the U.S. and understand how to automate the data collection process. As such, the project includes both a dataset and prototypes. Some of the prototypes were used in the [[Kauffman Incubator Project]].</onlyinclude>
==Project Location==
The master file can be found at
/bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''
Note that TFTRTA-AcceleratorFinal.txt in E:\projects\accelerators was updated to included all creation dates and dead dates.
==Relevant Former Projects==
==Update for Hira==
===Final MTurk Push===
Minh and I pushed a final batch of HITs to MTurk. We found that, among our data even after MTurk, we were missing timing info for around 1000 companies. Upon further inspection, we realized that around 800 of these companies belonged to only ~10 accelerators. We think the problem was that Google searches most recent results first, so we missed out on old cohorts for large accelerators. We therefore re-ran Minh's crawler on these accelerators with different year parameters. We got 650 results.
Upon pushing these to MTurk, we got good results for 144 companies. This number was the product of filtering out accelerators with no companies listed, no date listed, and no accelerator listed (after searching manually). We removed duplicates and removed accelerators we do not care about. The 144 companies collectively have 1,538 companies.
This file can be found here:
/bulk/McNair/Projects/Accelerators/Summer 2018/Final Turk Push.xlsx
The next step is to plug this sheet into Grace's Python script which takes these companies and converts each company to its own row, so that it can be merged with our other data.
===Manual Searching===
For the other 170 companies we lacked timing info for (that were not worth crawling for because there were few companies assigned to each accelerator) McNair Center interns manually searched for timing info. Of the 170 companies we searched for, we found timing information for 128 of them.
Excel master datasets are in: E:\McNair\Projects\Accelerators\Summer 2018Code and files specific to this URL finder are in: E:\McNair\Projects\Accelerators\Summer 2018\url finder====Results====I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.====Testing====In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'): E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlxWe filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.====Actual Run Info====The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. ====Using Python files===='''To use STEP1_crawl.py''': INPUT: a list of company names URL_Finder_(or anythingTool) you would like to find websites #Summer_2018_URL_Finder_work for by searching on google OUTPUT: a list of company names and the top X number of results from google 1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. 2. Change NUMRESULT to be however many results you would like from Google. 3. Adjust DONT_COLLECT to include any websites that you don't want. 4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + "whatever you want here")5. Change line 127 to be the name of your output file.'''To use STEP2_findcorrecturl.py''': INPUT: output file from STEP1 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with "no match" 1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. 2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. '''To use STEP3_clean.py''':Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2details.
1. Change file f to be the output file from STEP2 (you should delete anything that says "no match", and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file ===Seed DB Parser===See [[Seed DB Parser]] for this partinformation on functionality.
Your output should be a text file containing the company name and the URL that had the highest assigned score The results from crawling Seed DB gave us more information for 257 companies. This is located in STEP2. In case of more than 1 URL with the highest score, the script should take the first one(sheet: final): E:\McNair\Projects\Seed DB\merging work.xlsx