Changes

9,690 bytes added , 12:31, 6 October 2020

no edit summary

{{Project|Has project output=Data,Tool|Has sponsor=McNair ~~Projects~~Center

|Has title=U.S. Seed Accelerators

|Has owner=Connor Rothschild,

|Does subsume=Accelerator Data, Accelerator Seed List (Data),

}}

<onlyinclude>The [[U.S. Seed Accelerators]] project subsumes several related projects. These projects were intended to assemble near-population data on high-growth high-tech seed accelerators in the U.S. and understand how to automate the data collection process. As such, the project includes both a dataset and prototypes. Some of the prototypes were used in the [[Kauffman Incubator Project]].</onlyinclude>

==Project Location==

The master file can be found at

/bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''

Note that TFTRTA-AcceleratorFinal.txt in E:\projects\accelerators was updated to included all creation dates and dead dates.

==Relevant Former Projects==

This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].

Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].

==~~An Overview~~Update for Hira==~~This project will be used~~ ===Final MTurk Push=== Minh and I pushed a final batch of HITs to MTurk. We found that, among our data even after MTurk, we were missing timing info for around 1000 companies. Upon further inspection, we realized that around 800 of these companies belonged to ~~determine which~~ only ~10 accelerators ~~are~~ . We think the problem was that Google searches most ~~effective at churning~~ recent results first, so we missed out ~~successful startups, as well as what characteristics are exhibited by~~ on old cohorts for large accelerators. We therefore re-ran Minh's crawler on these acceleratorswith different year parameters. We got 650 results. ~~First~~ Upon pushing these to MTurk, we got good results for 144 companies. This number was the product of filtering out accelerators with no companies listed, no date listed, and no accelerator listed (after searching manually). We removed duplicates and removed accelerators we ~~need~~ do not care about. The 144 companies collectively have 1,538 companies. This file can be found here: /bulk/McNair/Projects/Accelerators/Summer 2018/Final Turk Push.xlsx The next step is to ~~gather as much~~ plug this sheet into Grace's Python script which takes these companies and converts each company to its own row, so that it can be merged with our other data as . ===Manual Searching=== For the other 170 companies we lacked timing info for (that were not worth crawling for because there were few companies assigned to each accelerator) McNair Center interns manually searched for timing info. Of the 170 companies we searched for, we found timing information for 128 of them. The sheet can be found here: https://docs.google.com/spreadsheets/d/1hGgxNwLph0tWtqO_8bNUGM-kzVXTeb-N26ojwL3TTuk/edit?usp=sharing And is ready to merge in with our existing data. ===Recoded Founders' Experience=== I have updated and reclassified Founders' job titles. We began with 451 unique job titles, and were able to condense them into 16 categories, which are:*Academic*Advisor*Board*C-level*CEO*Director*Founder*Investment*Management*Marketing*Partner*President*VP*Other The formulas used to recode, the old data, and the newest, updated data can ~~about~~ be found on this Google Sheet: https://docs.google.com/spreadsheets/d/179ML4c1cO_1zooCGj4yjuXXUPwTDKZu52656_8uoNig/edit?usp=sharing This has been merged into the File to Rule Them All. ===Recoded Stage=== I have updated and cleaned up the "what stage accs look for companies in" by splitting it up into three categories:*seed*early stage venture*late Other classifications were collapsed into these three or were not significant (n<2) enough to be coded as a classification. The Google Sheet used to recode the stage variable can be found here: https://docs.google.com/spreadsheets/d/1G_XbIrHB6YOU5tWs0dqZot6_eJLDsp-nxoAIuHM9_Yc/edit?usp=sharing It has been merged into The File To Rule Them All. ===Recoded Dead Accelerators=== We have updated dead accelerators on the following Google Sheet https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing This has been merged this into The File To Rule Them All. ===Recoded Equity/Investment=== The Google Sheet with this work is here: https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing It has also been merged into The File To Rule Them All. I have updated equity from data from https://www.seed-db.com/accelerators. I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:*the midpoint normalization, used by taking an average of the accelerators' investment ranges.*the upper bound normalization, used by taking an average of the highest amount accelerators will invest. This dual normalization was performed because many accelerators say they invest "Up to $__,___" so a midpoint may not accurately reflect actual investment amounts. The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313. '''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis. By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.''' ===Amazon Mechanical Turk Pricing===Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing] ===Recoded Founders Education=== I have recoded two components of the founders' education sheet: 1) Degree name has been reclassified into nine categories:*High School*Associates*Bachelors*Masters*Certificate*JD*MBA*PhD*Other 2) Majors have also been recoded into nine categories:*H = Humanities*SS = Social Sciences*NS = Natural Sciences*E = Engineering (includes computer science)*B = Business and Economics*L = Leadership*MBA*JD*O = Other The Google Sheet I used to reclassify can be found here: https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process/formulas of reclassifying, and '''Updated Info''' containing only good, updated data. The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we ~~can~~ have it saved in ~~order~~ the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed). ===Recoded multiple campuses and cohorts=== The File to ~~look at factors~~ Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank. The collaborative sheet that ~~differentiate successful vs~~Hira, Maxine, and I worked on to update this list can be accessed here: https://docs.google. ~~unsuccessful ventures~~com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing We have also added a new cohort list. ~~Next~~Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators. ===Fixed Manual Data from Google Sheet=== We created a new sheet with only data we ~~need~~ want to keep, and cleaned it up. That sheet is called "Good Data Only", at the same link: https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing I first used our "recap" and "announced" classification to ~~create a web crawling~~ standardize and fix the dates. *Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.*Column N is the # of weeks for an accelerator program ~~which will gather information about accelerators across~~ , gathered via VLookup from The File to Rule Them All.*Column O is the ~~world~~ Actual Date we want to record, and was gathered by ~~accessing their websites~~ subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.*Columns P and Q are the Month and Years stripped from Column O.*Finally, Column R is the season variable, as Ed said it should be coded. We have also gone through and removed all bad data, all duplicates, and ~~extracting information~~all rows without timing info. These is the most complete list possible. ===Recoded employee count=== I ~~believe~~ have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge). The employee count column is standardized and can easily be edited given some modification of the Excel formula. ==Recent Work== Here's a project update on the work that ~~our overall goal~~ has been done since coming to McNair. The most recent file is /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx''' ===Merging Cohort Companies with Crunchbase Info=== More information on this ~~research~~ part of the project can be found on the page [[Merging Existing Data with Crunchbase]]. The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''. Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:*short_description*long_description*category_list (details the company's category)*category_group_list (a less refined, more all-encompassing category classification)*founded_on date*employee_count*linkedin_url*address And the following information was also pulled from Crunchbase and '''merged''' with our existing data: *URL (was merged with courl cells) *city (was merged with colocation)*state_code (was merged with colocation)*country_code (was merged with colocation)*status (was merged with costatus) ===The Equity Variables=== [[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are: *Terms of joining - terms of joining accelerator and important details about program*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))*Equity Amount Normalized - this copies the previous column but only keeps %>0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)*Investment Amount - the $ the accelerator invests in a company to ~~gain insight into~~ begin, if relevant (also could be a range or a "up to $######")*Investment Notes - anything to comment on previous 4 columns These six variables tell us more about the ~~methods~~ characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take. Relevant information: *82 accelerators take equity, 42 do not, and we lack information for 37.*The average % of ~~successful~~ equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at acceleratorswho take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as ~~well as~~ 7% equity) and took mean. ===Matching Accelerators to ~~find out what exactly differentiates very successful~~ UUIDs via Crunchbase=== We've also added UUIDs for 163 of our 166 accelerators ~~from dead~~ . The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet. The file with acceleratorsmatched to Crunchbase UUIDs can be found at: /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx This is the master file and should never be modified unless we find a UUID changed.ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere. More information can be found on the [[Crunchbase Data]] page. ===Linking Accelerators to Founders/LinkedIn Crawling=== [[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:*Current Job Title*Location*Employer*Job(s) Title*Dates Employed*Time Employed*Location of jobs*Extra Description*School Name*Degree Name*Major*Attended*Graduated*Societies

~~Helpful Links: http://seedrankings~~This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.~~com/~~

~~This project is developing broad and near-population data on accelerators and their cohort companies~~===Finding Company URLs===See http://mcnair. ~~The objective is to identify which cohorts of which accelerators a cohort company was trained in, obtain details of the accelerators, and obtain~~ bakerinstitute.org/wiki/URL_Finder_(Tool)#Summer_2018_URL_Finder_work for more details ~~of the cohort companies, including information about any venture capital investment that the cohort company might have received and any IPO or acquisition the company may have experienced~~.

~~The primary use of this data is for an academic paper detailed on the~~ ===Seed DB Parser===See [[~~Matching Entrepreneurs to Accelerators and VCs (Academic Paper)~~Seed DB Parser]] ~~page~~for information on functionality.

However, this project can also provide useful data to other academic papers ([[Urban Start-up Agglomeration]], [[Hubs (Academic Paper)]], and [[Hubs Scorecard (Academic Paper)]]), projects ([[Houston Entrepreneurship]]) and blog posts The results from crawling Seed DB gave us more information for 257 companies. This is located in (~~under the [[Emerging Ecosystems]] umbrella project~~sheet: final): E:\McNair\Projects\Seed DB\merging work.xlsx

~~The most recent update provided on [[Accelerator Seed List (Data)]] was on 05/21/2018.~~ ==An Overview==This ~~update included~~ project will be used to determine which accelerators are the most ~~recent '''master file''' of accelerator~~ effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much dataas we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, ~~found at~~ ~~E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised~~ we need to create a web crawling program which will gather information about accelerators across the world by ~~Ed V2~~accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.~~xlsx~~

~~The Google Sheets Master Sheet (OUTDATED) is found here~~ ~~https~~Helpful Links: http://~~docs.google~~seedrankings.com/~~spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=0~~

==Remaining To Dos==

**What stage do they look for?

~~TODO:~~ ~~McNair/Projects/Accelerators/Fall 2017/unfound_founders.txtA 0 means we don't have founder data for that accelerator.Specs: A tab delimited text file with the following fields:~~ ~~Accelerator First Name Last Name LinkedInURL~~==(~~if possible~~Outdated)~~Getting the LinkedInURL will ensure accuracy, but will work without it.~~Necessary Steps==

*Shrey: Find "demo day" keywords, so that we can search AcceleratorName Year Keyword '''Acquiring the necessary data to complete the Accelerator Master Variable List and ~~get back potential demo day pages~~ ~~It is unclear if any of these tasks have been done since~~ the ~~update on 05/21. I~~ Cohort List will ~~begin by seeing which of these things have been carried out.~~require the following (not necessarily in this order):'''

==~~Other Listed To Dos~~=Step Zero: Connect to Crunchbase and Link Data===

*We have compiled a very long list of accelerators from many different databases. For the past couple of weeks, everyone in the center has been going through this list, 20 at a time, classifying each one as an accelerator or not an accelerator, and then proceeding to gather data on the accelerator using the process outlined below. This process went very smoothly. We have successfully gone through about 80% of the list. We are still missing information on the last hundred or so names. All of the collected data is located on the RDP, within the "Accelerators" folder under "Data" or on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 "Accelerator Master Variable List" Google sheet].*We have listed all of the startups from the accelerators that have break out cohorts on their website on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOneComplete -~~q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 "Accelerator Master Variable List" Google sheet]. This contains the following information in the "Cohort List (new)" sheet~~more info: ~~accelerator name, year, cohort name, company name, description, founders, category/sector, and location.~~ *Next steps include going through the demo day pages that have been downloaded and writing notes on the different types if possible (see [[~~Demo Day Page Google Classifier~~Crunchbase Data]]). ~~==Moving Forward==~~ ~~'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''~~

===Step One: LinkedIn Founders Data===

Maxine will acquire the list of accelerators who take equity from companies from the following sheet:

~~\bulk\~~E://McNair\/Projects\/Accelerators\/All Relevant Files\/accelerator_data_noflag.txt

Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.

A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:

~~\bulk\~~E://McNair\/Projects~~\Accelerator\~~/Accelerators/Summer 2018\/Connor Accelerator Work\/Nonprofits in US.xlsx

Warning: this file has 1 million rows

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

U.S. Seed Accelerators (view source)

Revision as of 12:31, 6 October 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools