Difference between revisions of "U.S. Seed Accelerators"

From edegan.com
Jump to navigation Jump to search
Line 50: Line 50:
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:
*Current Job Title
*Job(s) Title
*Dates Employed
*Time Employed
*Location of jobs
*Extra Description
*School Name
*Degree Name
==An Overview==
==An Overview==

Revision as of 15:09, 19 July 2018

McNair Project
U.S. Seed Accelerators
Project logo 02.png
Project Information
Project Title U.S. Seed Accelerators
Owner Connor Rothschild
Start Date 06/18/2018
Keywords accelerators, data
Primary Billing
Notes [[Has notes::Continuation of Accelerator Data]]
Has project status Active
Is dependent on Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data
Subsumes: Accelerator Data, Accelerator Seed List (Data)
Copyright © 2016 edegan.com. All Rights Reserved.

Project Location

The master file can be found at

/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx

Relevant Former Projects

This page serves as an updated and tidied version of the data and work presented on the Accelerator Seed List (Data) Project, which subsumed Accelerator Data. Both of these projects (and as a corollary, this project) are dependent on the Demo Day Page Parser, Industry Classifier, and the Whois Parser.

7/9/18 Update

Here's a project update on the work that has been done since coming to McNair.

The Equity Variable: COMPLETE

Maxine Tao and I have added five new variables to the Accelerator Master Variable List - Revised by Ed V2 file. Those variables are:

  • Terms of joining - terms of joining accelerator and important details about program
  • Equity (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information
  • Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))
  • Investment - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a "up to $######")
  • Notes - anything to comment on previous 4 columns
  • These five variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.

Relevant information:

  • 82 accelerators take equity, 42 do not, and we lack information for 37.
  • The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.

Matching Accelerators to UUIDs: COMPLETE

The file with accelerators matched to Crunchbase UUIDs can be found at:

/bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx

This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.

More information can be found on the Crunchbase Data page.

Linking Accelerators to Founders/LinkedIn Crawling: COMPLETE

Grace Tan got the LinkedIn Crawler (Python) to work, which means we currently have the following information about accelerator founders:

  • Current Job Title
  • Location
  • Employer
  • Job(s) Title
  • Dates Employed
  • Time Employed
  • Location of jobs
  • Extra Description
  • School Name
  • Degree Name
  • Major
  • Attended
  • Graduated
  • Societies

An Overview

This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.

Helpful Links: http://seedrankings.com/

This project is developing broad and near-population data on accelerators and their cohort companies. The objective is to identify which cohorts of which accelerators a cohort company was trained in, obtain details of the accelerators, and obtain details of the cohort companies, including information about any venture capital investment that the cohort company might have received and any IPO or acquisition the company may have experienced.

The primary use of this data is for an academic paper detailed on the Matching Entrepreneurs to Accelerators and VCs (Academic Paper) page.

However, this project can also provide useful data to other academic papers (Urban Start-up Agglomeration, Hubs (Academic Paper), and Hubs Scorecard (Academic Paper)), projects (Houston Entrepreneurship) and blog posts (under the Emerging Ecosystems umbrella project).

(OUTDATED) The most recent update provided on Accelerator Seed List (Data) was on 05/21/2018. This update included the most recent master file of accelerator data, found at

E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx

(OUTDATED) The Google Sheets Master Sheet is found here


Remaining To Dos

The last update on Accelerator Seed List (Data) said the following needed to be done:

  • Cross-reference sheet with data from Peter's old accelerator consolidation file ("accelerator_data_noflag" and "accelerator_data" in "All Relevant Files") and fill in missing data
  • Variables that are 100% NOT in these 2 files:
    • Cohort Breakout?
    • Subtype
    • Designed for Students?
    • Campuses
    • Stage
    • Software Tech
    • What stage do they look for?


McNair/Projects/Accelerators/Fall 2017/unfound_founders.txt

A 0 means we don't have founder data for that accelerator. Specs: A tab delimited text file with the following fields:

Accelerator   First Name   Last Name   LinkedInURL(if possible)

Getting the LinkedInURL will ensure accuracy, but will work without it.

  • Shrey: Find "demo day" keywords, so that we can search AcceleratorName Year Keyword and get back potential demo day pages

It is unclear if any of these tasks have been done since the update on 05/21. I will begin by seeing which of these things have been carried out.

Other Listed To Dos

  • We have compiled a very long list of accelerators from many different databases. For the past couple of weeks, everyone in the center has been going through this list, 20 at a time, classifying each one as an accelerator or not an accelerator, and then proceeding to gather data on the accelerator using the process outlined below. This process went very smoothly. We have successfully gone through about 80% of the list. We are still missing information on the last hundred or so names. All of the collected data is located on the RDP, within the "Accelerators" folder under "Data" or on the "Accelerator Master Variable List" Google sheet.
  • We have listed all of the startups from the accelerators that have break out cohorts on their website on the "Accelerator Master Variable List" Google sheet. This contains the following information in the "Cohort List (new)" sheet: accelerator name, year, cohort name, company name, description, founders, category/sector, and location.
  • Next steps include going through the demo day pages that have been downloaded and writing notes on the different types if possible (see Demo Day Page Google Classifier).

Moving Forward

Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):

Step Zero: Connect to Crunchbase and Link Data - COMPLETE

Crunchbase Data

Step One: LinkedIn Founders Data

This project will begin by working with Grace Tan and Maxine Tao to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through Crunchbase and find the UUID for companies and their founders (reference Crunchbase Data, Crunchbase Accelerator Founders, Crunchbase Accelerator Equity). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by Grace Tan).

The list of founders for accelerators can be found at

McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt

The Unfound Founders file codes a 0 for all companies not listed within the LinkedIn Founders file, and a 1 for those that do have founders listed.

Given the founders' names, we will then be able to use the LinkedIn Crawler (Python) to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).

Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase

In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not.

Maxine will acquire the list of accelerators who take equity from companies from the following sheet:

E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt

Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.

This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows.

We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.

Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.

We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.

From this, we get the following data:

  • Accelerator a given company went through
  • Year said company went through a cohort/Specific cohort company went through

Step Three: Demo Day Crawler

This part of the project relies on the contributions of the wonderful Minh Le. Better documentation for the project can be found on the Demo Day Page Parser, Demo Day Page Google Classifier, and Accelerator Demo Day project pages.

Essentially, this part of the accelerator data project will use the Demo Day Page Parser to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors (here's an example FAQ page from Y Combinator). The Demo Day Page Google Classifier will then determine if the page is, in fact, a demo day page.

Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):

  • The date a cohort began/the season the cohort went through the accelerator
    • This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:
E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx
  • I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.

Step Four: Non-profit Finder

More at Non-profit Finder

Another important step in this project is finding which accelerators are non-profits.

A comprehensive list of nonprofits taken from the IRS can be found here:

E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx

Warning: this file has 1 million rows

This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.

Potential problem:

  • The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.

Workflow Image

(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project: