Changes

Jump to navigation Jump to search
2,584 bytes added ,  12:31, 6 October 2020
no edit summary
{{Project|Has project output=Data,Tool|Has sponsor=McNair ProjectsCenter
|Has title=U.S. Seed Accelerators
|Has owner=Connor Rothschild,
|Does subsume=Accelerator Data, Accelerator Seed List (Data),
}}
<onlyinclude>The [[U.S. Seed Accelerators]] project subsumes several related projects. These projects were intended to assemble near-population data on high-growth high-tech seed accelerators in the U.S. and understand how to automate the data collection process. As such, the project includes both a dataset and prototypes. Some of the prototypes were used in the [[Kauffman Incubator Project]].</onlyinclude>
==Project Location==
The master file can be found at
/bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''
 
Note that TFTRTA-AcceleratorFinal.txt in E:\projects\accelerators was updated to included all creation dates and dead dates.
==Relevant Former Projects==
==Update for Hira==
After ===Final MTurk Push=== Minh and I pushed a final batch of HITs to MTurk. We found that, among our data even after MTurk, we were missing timing info for around 1000 companies. Upon further inspection, we realized that around 800 of these companies belonged to only ~10 accelerators. We think the problem was that Google searches most recent results first, so we missed out on old cohorts for large accelerators. We therefore re-ran Minh's crawler on these accelerators with different year parameters. We got 650 results.  Upon pushing these to MTurk, we got good results for 144 companies. This number was the product of filtering out accelerators with no companies listed, no date listed, and no accelerator listed (after searching manually). We removed duplicates and removed accelerators we do not care about. The 144 companies collectively have 1,538 companies. This file can be found here: /bulk/McNair/Projects/Accelerators/Summer 2018/Final Turk Push.xlsx The next step is to plug this sheet into Grace's Python script which takes these companies and converts each company to its own row, so that it can be merged with our other data. ===Manual Searching=== For the other 170 companies we lacked timing info for (that were not worth crawling for because there were few companies assigned to each accelerator) McNair Center interns manually searched for timing info. Of the 170 companies we searched for, we found timing information for 128 of them.  The sheet can be found here: https://docs.google.com/spreadsheets/d/1hGgxNwLph0tWtqO_8bNUGM-kzVXTeb-N26ojwL3TTuk/edit?usp=sharing And is ready to merge in with our Skype callexisting data. ===Recoded Founders' Experience=== I have updated and reclassified Founders' job titles. We began with 451 unique job titles, and were able to condense them into 16 categories, which are:*Academic*Advisor*Board*C-level*CEO*Director*Founder*Investment*Management*Marketing*Partner*President*VP*Other The formulas used to recode, the old data, and the newest, updated data can be found on this Google Sheet: https://docs.google.com/spreadsheets/d/179ML4c1cO_1zooCGj4yjuXXUPwTDKZu52656_8uoNig/edit?usp=sharing This has been merged into the File to Rule Them All. ===Recoded Stage=== I did have updated and cleaned up the "what stage accs look for companies in" by splitting it up into three categories:*seed*early stage venture*late Other classifications were collapsed into these three or were not significant (n<2) enough to be coded as a classification. The Google Sheet used to recode the stage variable can be found here: https://docs.google.com/spreadsheets/d/1G_XbIrHB6YOU5tWs0dqZot6_eJLDsp-nxoAIuHM9_Yc/edit?usp=sharing It has been merged into The File To Rule Them All. ===Recoded Dead Accelerators=== We have updated dead accelerators on the followingGoogle Sheet https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing This has been merged this into The File To Rule Them All. ===Recoded Equity/Investment=== The Google Sheet with this work is here: https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing It has also been merged into The File To Rule Them All. I have updated equity from data from https://www.seed-db.com/accelerators.
===Recoding Founders===I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:*the midpoint normalization, used by taking an average of the accelerators' investment ranges.*the upper bound normalization, used by taking an average of the highest amount accelerators will invest.
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in This dual normalization was performed because many accelerators say they invest "Up to $__,___" so a PivotTable found in the sheet)midpoint may not accurately reflect actual investment amounts.
We will need to talk about how to categorize The average investment when using the most extraneous job titlesmidpoint is $40,164 and the average investment when using the upper limit is $48,313.
'''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis. By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.''' ===Multiple Amazon Mechanical Turk Pricing===Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing] ===Recoded Founders Education=== I have recoded two components of the founders' education sheet: 1) Degree name has been reclassified into nine categories:*High School*Associates*Bachelors*Masters*Certificate*JD*MBA*PhD*Other 2) Majors have also been recoded into nine categories:*H = Humanities*SS = Social Sciences*NS = Natural Sciences*E = Engineering (includes computer science)*B = Business and Economics*L = Leadership*MBA*JD*O = Other The Google Sheet I used to reclassify can be found here: https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process/formulas of reclassifying, and '''Updated Info''' containing only good, updated data.  The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed). ===Recoded multiple campuses and cohorts===
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.
===Fixed Manual Data from Google Sheet===
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called "Good Data Only", at the same link:
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.
===Recode Recoded employee count===
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.
The employee count column is standardized and can easily be edited given some modification of the Excel formula.
 
===Normalized investment amount===
 
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators "take up to __%" equity and "invest up to $___" that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide "$____ up front and another $___ in follow up funding for each stage." How do we deal with these? Message me if you'd like to talk more about this.
 
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.
 
===Remaning to do===
 
*Founders Experience: code job title
*Founders Education: remove unknowns, code degree and code major
==Recent Work==
===Finding Company URLs===
Excel master datasets are inSee http: E:\McNair\Projects\Accelerators\Summer 2018 Code and files specific to this URL finder are in: E:\McNair\Projects\Accelerators\Summer 2018\url finder ====Results====I used STEP1_crawl//mcnair.py and STEP2_findcorrecturlbakerinstitute.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'. ====Testing==== In this file org/wiki/URL_Finder_(sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'Tool): E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx We filter #Summer_2018_URL_Finder_work for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible. To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs. It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above. ====Actual Run Info==== The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'. The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'. The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'. Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company.  ====Using Python files===='''To use STEP1_crawl.py''':  1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search.  2. Change NUMRESULT to be however many results you would like from Google.  3. Adjust DONT_COLLECT to include any websites that you don't wantdetails.
4===Seed DB Parser===See [[Seed DB Parser]] for information on functionality. If you would like to add another search keyword, add this in line 87 which is queries.append(name + "whatever you want here")
5The results from crawling Seed DB gave us more information for 257 companies. Change line 127 to be the name of your output fileThis is located in (sheet: final): E:\McNair\Projects\Seed DB\merging work.xlsx
==An Overview==

Navigation menu