Changes

Jump to navigation Jump to search
no edit summary
{{Project|Has project output=Data,Tool|Has sponsor=McNair ProjectsCenter|Project TitleHas title=Accelerator Seed List (Data),|Topic Area=Entrepreneurship Ecosystems,|OwnerHas owner=Shrey Agarwal, Matthew Ringheanu, Veeral Shah, Connor Rothschild,|Start TermHas start date=Fall 2016,|KeywordsHas keywords=Accelerators,Data|Primary BillingHas project status=AccMcNair01,Subsume|Is dependent on=Industry Classifier
}}
=Current Work=
 
===As of 05/21/2018 the Google Sheet Workbook has been downloaded to the E drive. The now Excel Workbook is saved at E:\McNair\Projects\Accelerators\Summer 2018\Accelerator Master Variable List.xlsx. This is now the master file.===
 
Google Master Sheet: https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=0
*Cross-reference sheet with data from Peter's old accelerator consolidation file ("accelerator_data_noflag" and "accelerator_data" in "All Relevant Files") and fill in missing data
*Variables that are 100% NOT in these 2 files:
**Cohort Breakout?
**Subtype
**Designed for Students?
**Campuses
**Stage
**Software Tech
**What stage do they look for?
 
TODO:
McNair/Projects/Accelerators/Fall 2017/unfound_founders.txt
A 0 means we don't have founder data for that accelerator.
Specs: A tab delimited text file with the following fields:
Accelerator First Name Last Name LinkedInURL(if possible)
Getting the LinkedInURL will ensure accuracy, but will work without it.
 
 
*Shrey: Find "demo day" keywords, so that we can search AcceleratorName Year Keyword and get back potential demo day pages
 
 
==Accelerator Type project==
 
File to edit is called "Accelerator type list". Located in the folder E:\McNair\Projects\Accelerators\Spring 2018\Grouping project of ListOfAccs. More systematic information and instructions are in"Instructions for Accelerator type project" in E:\McNair\Projects\Accelerators\Spring 2018\Grouping project of ListOfAccs.
 
NOTE: until we get through all 270 accelerators, we will just categorize each accelerator into the following three categories as quickly as possible with short notes in teh "other info" column for these; once we have this, we will go back through the ones that aren't categorized and add notes to the "other info" column.
 
 
Type list:
*Private
*Corporate
*Academic
Note: if DEAD, noted here.
 
 
Other info:
*nonprofit? (y/n)
 
*Subtype abbreviations:
**S: for if a social entrepreneurship initiative
**I: for if an incubator
**A: for an angel group
**F: for foreign
**C: for in coworking space/hub/etc
**V: for if part of venture fund
**G: for if government funded/partnered
**T: for international
 
 
Note: subtypes (from individual text files in E:\McNair\Projects\Accelerators\Spring 2017\Code+Final_Data) were only found for 23 of the 270 accelerators. These accelerators were initially intended to be removed from the master list. Remaining subtypes are currently being added.
 
other info:
 
international offices, founders, industries, org type, program duration, or other interesting, easily accessed variables. Additional information is especially important for accelerators that have no other subtype abbreviation listed.
 
 
===Steps to research an accelerator===
 
1. Copy/paste URL listed in Accelerator type list file into google. If website is insufficient, try googling:
the name of the accelerator
the name of the accelerator + "crunchbase"
the name of the accelerator + "nonprofit"
 
the above steps sometimes lead to other helpful databases/news articles
 
2. Note whether:
1) Academic/Corporate/Private
2) For Profit/Nonprofit. Sometimes this isn't directly stated but can be inferred through their description of, say their investment process. If they don't address this at all it's probably For Profit.
3) subtype (S, I, A, F, C, V, G, T).
4) Additional, easily-accessed info. Number 4 is really important if there's no subtype.
 
All 270 need to be done by the end of the semester.
 
 
Type list file saved as
"Accelerator type list" in E:\McNair\Projects\Accelerators\Spring 2017\Grouping project of ListOfAccs.
The list of ListofAccs, from which we drew Accelerator type list, should have no matches with any of the flagged accelerators in E:\McNair\Projects\Accelerators\Spring 2017\Code+Final_Data. There are 23 matches though. So all subtypes must be searched and entered manually. Whether some were a nonprofit was listed in E:\McNair\Projects\Accelerators\Spring 2017\Grouping project of ListOfAccs, called "whether nonprofit...". Accelerators with no info there on whether nonprofit need to have info entered manually.
 
=Funded By Accelerators=
 
Reference the like-named portion in [[Crunchbase Data#Funded by Accelerators|Crunchbase Data]]
 
=End of Semester Report=
The end of semester report will focus on ranking accelerators and environments based on the variables we have gathered. Our primary form of categorization will be ranking individual accelerators based on their venture capital raise rate. We can probably generate information over time for accelerators and the amount of VC they raised to get a sense of what locations have developed in the past five years from the dates of transactions recorded by SDC. To obtain these rankings, we will identify which cohorts companies were trained in, as well as complete details of the accelerator and the details of cohort companies. We will focus only on accelerators because there are many other entities in each ecosystem. We will also utilize information on IPO or acquisition by companies, obtained through Crunchbase, to gain some sense of how successful startups emerging from a particular accelerator are. To obtain the data over time, we will need to fill out the cohort date information column in our cohort data, which will require the help of either Crunchbase or the Wayback machine for older accelerators. In ranking the accelerators across regions, we can also track industry-specific hotspots for accelerators such as medicine in Memphis or technology in San Francisco.
 
To complete the report, we need to fill information in:
*Industry and focus
*Location
*Name, description
*Matched VC data
*Founder information (maybe)
 
=Overview=
This project is developing broad and near-population data on accelerators and their cohort companies. The objective is to identify which cohorts of which accelerators a cohort company was trained in, obtain details of the accelerators, and obtain details of the cohort companies, including information about any venture capital investment that the cohort company might have received and any IPO or acquisition the company may have experienced.
 
The primary use of this data is for an academic paper detailed on the [[Matching Entrepreneurs to Accelerators and VCs (Academic Paper)]] page.
 
However, this project can also provide useful data to other academic papers ([[Urban Start-up Agglomeration]], [[Hubs (Academic Paper)]], and [[Hubs Scorecard (Academic Paper)]]), projects ([[Houston Entrepreneurship]]) and blog posts (under the [[Emerging Ecosystems]] umbrella project).
 
This project needs the results of the [[Industry Classifier]], [[Whois Parser]], and other tools.
 
=Current Project Write-Up=
 
==Things To Do==
*Obtain all URLs for accelerators in order to run through the Wayback Machine to find out when they started.
*Match Crunchbase Data with our Accelerator List to see if they have any accelerators that we do not.
*Obtain an example of accelerator that started early and has multiple companies but does not separate them into cohorts and figure out a way to determine which companies went through each cohort.
 
==What Each File in the "Accelerator" Folder on the RDP Contains==
*"Accelerator List Sources" (Folder) - This folder contains most of the sources that we pulled accelerator names from at the very beginning of the project.
*"Code+Final_Data" (Folder) - This folder contains Peter's code for pulling the data from the text files in the "Data" folder.
*"Crunchbase Snapshot" (Folder) - This folder contains the data we obtained from Crunchbase. There is a massive amount of data which we will need to sort through to find useful information and hopefully match that data with our current cohort data.
*"Data" (Folder) - This folder contains all of our data on accelerators including cohort information and the html files of each cohort page. I would estimate that it is about 95% clean currently.
*"Data - Copy" (Folder) - This is just a copy of our current "Data" folder.
*"Data_Copy" (Folder) - This is a copy of our original "Data" folder before we did any manual cleaning.
*"Enclosing_Circle" (Folder) - This folder seems to contain some data on VC but I'm not sure how it pertains to the Accelerator project.
*"F6S Accelerator HTMLs" (Folder) - This folder contains the HTML pages of all the pages on the F6S website. We used it to add more potential accelerators to our list.
*"Google_SiteSearch" (Folder) - This folder contains Python code for Google searches.
*"Industry_Classifier" (Folder) - This folder seems to contain Python code but I'm not sure what for.
*"Matcher" (Folder) - This folder contains the Matcher.
*"Python WebCrawler" (Folder) - This folder contains code that is a work in progress for pulling descriptions from accelerator websites. It is Jeemin's project.
*"Cleaned Cohort Data Copy" (Excel File) - This file contains a copy of our cleaned cohort data.
*"Cleaned Cohort Data" (Excel File) - This file contains the most current, completely cleaned data on cohort company information.
*"NormalizeFixedWidth" (PL File) - This is the normalizer.
*"PortCoNames" (TXT File) - This file contains all of the names of the cohort companies as well as the accelerator they went through.
*"VC Data" (Excel File) - This file contains all of the names of the companies that have ever received VC funding.
*"VC_Data" (TXT File) - This file contains that non-normalized data of all of the VC information.
*"VC_Data_Names" (TXT File) - This file contains all of the names of companies that have received VC funding.
*"VC_Data_Names_Matched_PortCoNames" (Excel File) - This file contains all of the cohort companies that have also received VC funding. Still needs to be sorted through.
 
==Process==
After accumulating the massive amount of data on accelerators, their cohorts, and their html files, we began cleaning those text files, which are located in the "Data" folder within "Accelerators". After going through the first round of cleaning, we ran a code through the cohort data which put all of that information into an Excel document called "Cleaned Cohort Data". There were still some mistakes in the cohort information unfortunately, which we fixed within the Excel file itself. Therefore, there are some text files within the "Data" folder that do not match with the "Cleaned Cohort Data" file. If we were to run the cohort code through the "Data" folder, we would get something that does not match with the "Cleaned Cohort Data" file, which is problematic. The solution to this (other than manually cleaning the text files again) would be to write a code from the "Cleaned Cohort Data" file which would allow us to clean the data in the "Data" folder through the format of the Excel file. We have also matched all of the cohort companies with our list of all companies that have received VC funding.
 
=Current To Do=
 
#Work on the [[Crunchbase 2013 Snapshot]]
#Match cohort companies to VC-backed portfolio companies
#Refine our data to work out which cohort each cohort company was a member of, cohort start dates and locations, etc.
#Make a list of top accelerator lists (e.g., http://tech.co/top-startup-accelerators-ranked-2012-08) and check that we have those accelerators
=End of Semester Notes=
*We have compiled a very long list of accelerators from many different databases. For the past couple of weeks, everyone in the center has been going through this list, 20 at a time, classifying each one as an accelerator or not an accelerator, and then proceeding to gather data on the accelerator using the process outlined below. This process went very smoothly. We have successfully gone through about 80% of the list. We are still missing information on the last hundred or so names. All of the collected data is located on the RDP, within the "Accelerators" folder under "Data"or on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 "Accelerator Master Variable List" Google sheet].*We have listed all of the startups from the accelerators that have break out cohorts on their website on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 "Accelerator Master Variable List" Google sheet]. This contains the following information in the "Cohort List (new)" sheet: accelerator name, year, cohort name, company name, description, founders, category/sector, and location. *Next steps include going through the demo day pages that have been downloaded and writing notes on the different types if possible (see [[Demo Day Page Google Classifier]]).
=Data Collection Notes=
 
==MATCHING==
 
The files we used to match are located in the E drive. We used the matcher to match our portfolio company names from the cohort file located in E:\McNair\Projects\Accelerators.
*The files used to matching are located E:\McNair\Projects\Accelerators\Matcher
*Portco is the name of the companies pulled from the cohort file
*AccCo includes both the cohort company name, along with the name of the accelerator itself
*In the matcher, the inputs are the PortCo names, as well as the VC data from our pull in SDC
*The outputs include the AccCo_VC data located in E:\McNair\Projects\Accelerators which give a lot of information on the matches, including:
:*name of the match itself
:*number of investments
:*dates that the company received its investments
 
==SDC Pull==
 
We accessed SDC platinum and pulled information on round-based funding that all registered companies received from between the years 1999 to 2017.
 
The receipt is as follows:
 
Session Details
---------------
Request Hits Request Description
0 - DATABASE: Portfolio Companies (VIPC)
1 96155 Venture Related Deals: Select All Venture Related Deals
2 79572 Round Date: 1/1/1999 to 3/1/2017 (Custom) (Calendar)
3 Custom Report: VC Data (Columnar) - Save As:
E:\McNair\Projects\Accelerators\VC Data.txt
Billing Ref # : 2054025
Capture File : riceuniv.2054025
Session Name :
 
The VC data pull includes the following variables:
 
Company Name Date Company Date Company Company Company City Company Street Address, Line 1 Company Street Address, Line 2 Total Known Company Industry Sub-Group 3 Company Industry Major Group Round Company Stage Level 3 Round Amt, Round Amt,
==3 files==
Try to get '''Name, Score, Flag, Cohort URL and Address''' for all. ONLY GRAB OTHER VARIABLES IF EASY. Just leave things blank if you can't find them quickly.
'''If the score is 0, or the flag is S, I, A, or F just stop''' - don't bother downloading a cohort list, saving an HTML file, etc. If possible, do stick a very brief description of the problem in the notes field.
Notes:
*The first column must be the portfolio company name
*Grab as many columns as you can easily (and name them)
 
==Standardized format for text files==
 
Information Text file
*1 tab only after each category
*No spaces after commas for flags or industry
*For duration put only a number in weeks but do not write "weeks"
*Equity is either only a number (no percent sign) or a Y/N
 
 
Cohort Text file
*1 tab between each column
*Titles of each column on top
*Make a new category for "Cohort Number" and write either "1 2 3 4 etc."
*Matthew: 1-225 (done) Shrey: 226-550 (done)
 
==Link to Crunchbase API application==
 
https://about.crunchbase.com/forms/research-access-apply/ (Does not work anymore)
 
https://data.crunchbase.com/v3/docs/using-the-api (Has new instructions for application)
==Sign-Ups==
Peter - 121-140 (done)
Ramee - 141-160 (done)
Will - 161-180 (done through 167)
Matthew - 181-200 (done)
Julia - 201-220 (done)
Julia - 362-380 (done)
Dylan - 381-393 (done)
Jake - 394-404(done)
Dylan - 405-410 (done)
Avesh - 411-415(done)
Dylan - 416-423 (done)
Peter - 424-460(done)
Peter - 481-490(done)
Julia - 491-510 (done)
Peter - 511-515(done)
Julia - 516-529 (done)
Ben - 530-540(done)
Shrey - 541-551 (done)
#TEB Incubation & Acceleration Center
#THRIVE Accelerator III
#THRIVE Open Innovation(DUPLICATE)
#TIM#WCAP Accelerator
#TLabs
#Telluride Venture Accelerator
#TenX
#The Alchemist Accelerator(DUPLICATE)
#The Ark
#The Bakery
#VC FinTech Accelerator
#Velocity Indiana Accelerator
#Velocity Venture CatalystPartners
#Venture Hive
#Venture I
#eMerging Ventures
#ezone
#iStart Jax(DUPLICATE)
#iStart Valley
#iVentures10
=Sources=
Summary: These are sources obtained from [[List of Accelerators]] , Crunchbase, and other Google searches. We will evaluate these sources by looking at the number of accelerators they supply (as most of them are lists) and then also taking a look at the type of information they provide about each accelerator. Key data points are cohort-related data, startup-related data, and logistics of the accelerator. Better sources supply more information that the URL alone.
(Obtained from [[List of Accelerators]] and various Google searches)
*http://www.represent.la/
*http://www.launch.co/blog/complete-list-of-incubators-and-accelerators-like-y-combinat.html
*https://angel.co/accelerator-4(Does not work - seems to be replaced by https://angel.co/companies?company_types[]=Incubator )
(Obtained from Google search: "Accelerator Database")
*Type in a specific state + "accelerator" + "list" (e.g. Texas accelerator list) to search for more relevant lists
:*Once again, looked at roughly the first 20 results
*Crunchbase has its own webpage with instructions for how we retrieve the data
=Source Evaluations=
Summary: These evaluations couple with each of the sources above. The evaluations provide instructions for obtaining the information listed, as well as a general review of how useful the data seems. The review serves to determine whether a crawler would be suitable for obtaining information from the source autonomously.
 
==SOURCE: Crunchbase==
*All of the information for the Crunchbase documentation is located in the page [[Crunchbase 2013 Snapshot]] webpage, along with the documentation for how we determined the accelerator information.
==Source: http://www.acceleratorinfo.com/see-all.html==
==Source: http://www.seed-db.com/accelerators/all==
#Copied "Seed Accelerators" table to TextPad, data sorted itself into lines. Returned 235 results.
#Clicking on the accelerator name itself links to a page with all of its associated startups, up until 6/2016 cohort
*Overall very extensive data for accelerators that are included on the list, but after cross-referencing from other sources shows that seed-db is lacking many newer accelerators; list is not all-inclusive.
*Includes regional distributions for accelerator groups as well. For example, rather than just "Techstars", the group is broken into Austin, Berlin, Boston, Boulder, etc.
 
==Source: http://www.seed-db.com/accelerators==
*Examples of single accelerators found
:#TMCx: http://www.tmc.edu/innovation/innovation-programs/tmcx/
:#RED labs: http://redlabs.uh.edu/8
:#SURGE accelerator: https://kirkcoburn.com/
:#OwlSpark: http://owlspark.com/
:#NextHIT: http://www.houstonhealthventures.com/nexthit-accelerator-program-application/
 
===Los Angeles Accelerators===
:#Amplify: http://amplify.la/
E:\McNair\Projects\Accelerators\Google_SiteSearch
This folder contains code for a google search parser. The script sitesearch.py will search for a queried company and return a likely web address for that company.
 
==Way Back Machine Parser==
E:\McNair\Projects\Accelerators\Code+Final_Data\wayback_machine.py
This script takes URLs and returns a timestamp for the oldest documented webpage under that URL courtesy of the Way Back Machine Archive.
 
==Process Locations==
E:\McNair\Projects\Accelerators\Code+Final_Data\process_locations.py
This script takes a physical address and converts it into latitude and longitude coordinates. Should be used in conjunction with the Enclosing Circle program to find the concentration of accelerators.
E:\McNair\Software\CodeBase\EnclosingCircle.py
 
=Kauffman Foundation Incubator Proposal Information=
 
==Institutions==
Summary: F6S, Crunchbase, seed-db
 
Tools: Matcher - used to match lists of potential accelerators with our current list to identify duplicates/new matches (E:\McNair\Projects\Accelerators)
 
===F6S===
F6S WebCrawler and F6S Parser - E:\McNair\Projects\Accelerators\F6S Accelerator HTMLs
 
===CrunchBase===
 
CrunchBase 2013 Snapshot '''(All Organizations)'''- E:\McNair\Projects\Accelerators\organizations.xls
 
CrunchBase 2013 Snapshot '''(Potential Accelerators)'''- E:\McNair\Projects\Accelerators\organizations.accdb under "Potential Accelerators query"
 
*Obtained using keyword matches in the descriptions of the potential accelerators.
 
CrunchBase 2013 Snapshot '''(New Verified Accelerators)''' - E:\McNair\Projects\Accelerators\New CrunchBase Accelerators.xls
 
We have the Crunchbase 2013 Snapshot which provided lots of new data on accelerators and incubators but we would love to use the Crunchbase API to get a current database snapshot that we could use to cross reference companies and add newly formed accelerator and incubator companies.
 
===AngelList===
 
===seed-db===
 
Obtained through www.seed.db/accelerators
 
===Global Accelerator Network (GAN)===
 
GAN Parser- E:\McNair\Projects\Accelerators\Web Scraping for Accelerators\scrapeaccel.py
 
GAN Data- E:\McNair\Projects\Accelerators\Web Scraping for Accelerators\GAN Accelerator Data
*Contains: Company Name, # of Companies Range, % of Companies Funded, Funding Raised by Companies, Employee Range, Exit Funding, Exit Date, Total Company Funding Raised, # of Mentors Range, % Equity, Location, Minimum Seed Capital Investment
 
==Cohorts==
 
*Cohorts obtained manually
*All Cohort txt files are saved under "E:\McNair\Projects\Accelerators\Data
*cohort file name = (accelerator name).cohort
*Most updated Accelerator cohort data: E:\McNair\Projects\Accelerators\Cleaned Cohort Data.xls
 
Automation for obtaining cohorts??
 
==Other Information==
Summary: Whois Parser, Geocode, Tools to determine industry, etc
 
===Whois Parser===
 
*Retrieves and parses Whois information. Specifically, takes a file with a column of domain names and populates the corresponding columns with information from the WhoIs API.
 
*Often used to obtain locations.
 
===Geocode===
 
Input: Company Address
Output: Directional Coordinates
 
*Used to obtain the locations of different Accelerators and Cohort companies.
 
===SDC Platinum Pull===
 
Used to obtain funding information and match companies that have gotten funding with companies that are Accelerator cohorts.
 
===Desired Information/Variables===
 
*Key People (founders, lead entrepreneurs, strategists, etc.)
*Total number of launched companies
*A FAQ for application details, accelerator vision, and
*Funds raised per company (average)
*Features offered by accelerator (perks, space, tools, etc)
 
==Desired Tools/Information==
 
===Automating the Process of Obtaining Cohorts===
*Automating this process would save a lot of time and really progress the project.
 
===Obtaining More Details on Accelerators===
 
*Having the kind of thorough information on industry, companies, funding, location, exits, mentors, leadership, that we got for the GAN companies would be fantastic.
 
===List of Alive/Dead Accelerators===
 
This is a dream but would be very helpful

Navigation menu