Difference between revisions of "Incubator Seed Data"

From edegan.com
Jump to navigation Jump to search
 
(70 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
{{Project
 
{{Project
 +
|Has project output=Data
 +
|Has sponsor=Kauffman Incubator Project
 +
|Has sponsor=Kauffman Incubator Project
 
|Has title=Incubator Seed Data
 
|Has title=Incubator Seed Data
 
|Has owner=Anne Freeman,
 
|Has owner=Anne Freeman,
 
|Has project status=Active
 
|Has project status=Active
|Is dependent on=Crunchbase Database,
+
|Is dependent on=Crunchbase Database, INBIA, Google Crawler
 +
|Does subsume=Incubator Seed Data Coverage,
 
}}
 
}}
=Goal=
+
Requirement: Determine at least 4 primary data sources, or secure licenses to extract ‘seed data’ from these sources, as measured by program records.
 +
 
 +
Status: We have identified at least 4 primary data sources. [[Crunchbase Database|Crunchbase]] is our biggest structured source for incubators, and we have a license for Crunchbase Pro. Our other two structured sources are [[AngelList Database|AngelList]] and [[INBIA]]. Given the paucity of strong sources, we decided to use a custom [[Google Crawler]] as a source. We will also be creating a new [[VentureXpert Database]] using data drawn from SDC Platinum, so that we have a source of information on venture capital backed startup firms.
 +
 
 +
==Goal==
 +
 
 
We will evaluate data sources based on the number of incubators they have data on and the type of information they supply on these incubators. We will also record whether or not these data sources collect information on any other types of entrepreneurship organizations. Ideally these data sources would provide some or all of the variables that were identified as most important for identifying incubators ([[Formulate_baseline_attributes]]). However, it is unlikely that one data source will contain all of the baseline attributes identified, therefore if the data source can provide links to a large quantity of incubators or in-depth descriptions, they could still be viable.
 
We will evaluate data sources based on the number of incubators they have data on and the type of information they supply on these incubators. We will also record whether or not these data sources collect information on any other types of entrepreneurship organizations. Ideally these data sources would provide some or all of the variables that were identified as most important for identifying incubators ([[Formulate_baseline_attributes]]). However, it is unlikely that one data source will contain all of the baseline attributes identified, therefore if the data source can provide links to a large quantity of incubators or in-depth descriptions, they could still be viable.
 +
 +
==Chosen Sources==
 +
 +
Our primary incubator datasources are:
 +
*[[Crunchbase Database|Crunchbase]]
 +
*[[INBIA]]
 +
*[[AngelList Database|AngelList]]
 +
*[[Google Crawler]]
 +
*[[Yi Ma]]'s work assembling [[US Incubators]], state-by-state, for this project
 +
*ClusterMapping
 +
*Wharton entrepreneurship club
 +
*Gaebler
 +
 +
The [[Google Crawler]] was added as, with the exceptions of [[Crunchbase Database|Crunchbase]] and AngelList, the structured sources are all small. It's coverage is superb.
 +
 +
In addition, we will be using the sources listed below, and [[VentureXpert Database]] as a primary reference seed source (to see whether client companies received venture capital).
 +
 +
==Evaluation of Main Sources==
  
  
=Evaluation of Sources from Specific Google Searches=
 
*Searches included:
 
:* "incubator database"
 
:* "us business incubators database"
 
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
 
! Source
 
! Source
 
! Directions
 
! Directions
! Data on how many?
+
! How many?
 
! Data
 
! Data
 
! Benefits
 
! Benefits
Line 33: Line 56:
 
| May not be able to get specific information from home page. Limited list of incubators. Some organizations listed may not fall under our definition of an incubator (eg. Y Combinator)
 
| May not be able to get specific information from home page. Limited list of incubators. Some organizations listed may not fall under our definition of an incubator (eg. Y Combinator)
 
|-
 
|-
| [http://exchange.inbia.org/network/findacompany/ National Business Incubation Association]
+
| [http://exchange.inbia.org/network/findacompany/ InterNational Business Incubation Association] or see our [[INBIA]] page
 
|  
 
|  
 
* Opened source link
 
* Opened source link
Line 45: Line 68:
 
Out of the first ten links there were: 4 incubators, 2 educational programs, 1 broken link, 3 other economic development programs
 
Out of the first ten links there were: 4 incubators, 2 educational programs, 1 broken link, 3 other economic development programs
 
|-
 
|-
 +
| [https://www.clustermapping.us/organization-type/innovation-and-entrepreneurship-support-organizations Clustermapping]
 +
| Opened Source Link
 +
| 292
 +
|
 +
* Company name with link to a separate page within cluster mapping
 +
* on that page there is a link to the incubator's website
 +
| Provides a long list of entrepreneurship organizations
 +
| Often data is missing off of the separate page for the company, including the URL to the company's website. The description is often not detailed enough to determine the category for the economic organization without going to the company's website.  Different types of entrepreneurship organizations are mixed together. 
 +
Using the first 10 links, three were accelerators, six were missing links (two were self-proclaimed incubators in description), and one was another type of support organization.
 +
|-
 +
| [https://thembaisdead.com/list-of-startup-accelerators-and-incubators/ The MBA Is Dead ]
 +
|
 +
* Opened source link.
 +
* Selected "Region" >> "US & Canada"
 +
| 186 Results
 +
|
 +
* Click on each accelerator/incubator to get data
 +
* City and Country
 +
* low equity, high offer, high value
 +
* high equity, low offer, low value
 +
* link to company homepage
 +
* categories of companies it accelerates/incubates
 +
| Can search by region or by category of companies
 +
| Seems to be a lot of data on accelerators and fewer incubators included
 +
 +
Out of the first 10 unique company links -- 1 was a broken link, 7 were accelerators, and 2 could possibly be incubators
 +
|-
 +
| [https://angel.co/?ref=nav AngelList]
 
|
 
|
 +
*Opened source link.
 +
*Typed "incubator" in the search box
 +
*Clicked on "Search for 'incubator'
 +
| 1,444 Results
 
|
 
|
 +
*Click on each incubator to get data
 +
*City and Categories
 +
*Number of Employment and URL
 +
| Can use key word "incubator" to filter data
 +
| Contains some hybrid of incubator and accelerator
 +
|-
 +
| [http://www.gaebler.com/Business-Incubator-Lists-By-State.htm Gaebler]
 
|
 
|
|
+
*Opened source link.
|
+
*Browsed a list of incubators by state
|
+
| 360 Results
 +
| URL, incubator name
 +
| Well-organized list of incubators by state. Data is in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\Gaebler\Results.txt and the script to retrieve the results is in the same director and called Gaebler.py.
 +
| It only provides URL and incubator name; contains bad links
 
|}
 
|}
  
 +
These main sources were found with Google Searches that included keywords like "incubator database", "us business incubators database", and others.
  
 +
== [[Accelerator_Seed_List_(Data)#Sources | Accelerator Data Sources]] that are Potentially Viable ==
 +
{| class="wikitable"
 +
|-
 +
! Source
 +
! Directions
 +
! How many?
 +
! Data
 +
! Benefits
 +
! Limitations
 +
|-
 +
| [http://www.acceleratorinfo.com/see-all.html Accelerator Info]
 +
|
 +
* First column is “All Startup Support Programs”  (215)
 +
* Second column is “All University Programs” (249)
 +
| 464 in total
 +
| Each link on parent list leads to individual '''home page url''' of organization (that's all it is)
 +
| Mixed information on incubators and accelerators. Some of the university supported programs may not be considered either an incubator or an accelerator and the data has non-trivial classification problems!
  
==Source: https://www.clustermapping.us/organization-type/innovation-and-entrepreneurship-support-organizations ==
+
Out of the first 10 links, 3 bad links, 3 potential incubators, and 4 accelerators
# Opened source link
+
|-
# Received 292 results for Innovation and Entrepreneurship Support Organizations in the US
+
| [https://www.galidata.org/accelerators/directory/?keyword=&region=north_america Galidata]
# Data
+
| Filter by Region: North America
:# Brief Description
+
| 584 "accelerators"
:# Company name with link to a separate page within cluster mapping
+
|
::# Link to Company Website
+
* Company Name
::# Regions
+
* Link to homepage
'''Review'''
+
* Location
* Provides a long list of entrepreneurship organizations
+
* Short Description (often blank)
* Limitation: Often data is missing off of the separate page for the company, including the URL to the company's website. The description is often not detailed enough to determine the category for the economic organization without going to the company's website.
+
* Region
* Limitation: Different types of entrepreneurship organizations are mixed together
+
* URL
* Using the first 10 links, three were accelerators, six were missing links (two were self-proclaimed incubators in description), and one was another type of support organization.
+
| Reliable links directly to homepage of companies, can search "U.S. and Canada" or other regions.
 +
| Mix of incubators and accelerators. Needs a custom crawler. Description field is too unreliable for classification.
 +
Out of the first 10 organizations in the US -- 6 were accelerators and 4 could potentially be incubators.
 +
|-
 +
|  [[:Crunchbase Database]]
 +
| See the [[Crunchbase Database]] project page for more information.
 +
|
 +
|
 +
|
 +
|
 +
|-
 +
| [https://www.s-b-z.com/FORMING%20THE%20BUSINESS/db/accelerators.aspx S-B-Z]
 +
| Open and copy and paste into excel then clean up
 +
| 143
 +
| Contains Name, URL, Description, Industry, Type, City, State
 +
| In E:\projects\Kauffman Incubator Project, as excel and txt
 +
| Mostly accelerators but contains a classifiable description field.
 +
|}
  
 +
Outside of the U.S., there is the UK Business Incubators and Accelerators Directory, which is saved in E:\projects\Kauffman Incubator Project\Business-incubators-accelerators-directory-update.xlsx. It has 216 incubator records (including 'Incubator (University Enterprise Zone)') with a record for each location of an incubator.
  
 +
==Region Specific Incubator Sources==
  
=Evaluation of Sources from INIBIA List of US Accelerator Associations=
+
'''See [[US Incubators]], which extends the notes in this section with data collection.'''
The National Business Incubation Association maintains a list of [https://inbia.org/services/resources/ U.S. Incubation Associations]. We went through this list and evaluated each association as a potential data source. These sites generally contain a list of incubators that are working in collaboration with the NBIA and are within that specific state. The sites could be useful in cross-referencing data pulled from the NBIA main database as some of the incubators listed on the state specific websites are not in the main NBIA database.
 
  
==Source: [http://asbdc.org/start-ups/incubators-in-alabama/ Alabama Business Incubation Network]==
+
Many state and local governments contain information on incubators and accelerators that operate within their jurisdiction. They do not provide comprehensive sources on all incubators within the US but could be helpful as sources to cross-reference with a larger database.
# Opened source link
 
# Counted incubators listed on the home page and found information for 12 incubators
 
# Data
 
:* Incubator Name, Brief Description, and a link to the home page
 
'''Review'''
 
* Provides reliable links to incubators within Alabama
 
* Benefit: data is filtered to include only incubators
 
* Limitation: only incubators within Alabama
 
==Source: http://www.fbiaonline.org/Incubators/incubators.htm Florida Business Incubation Association (FBIA)==
 
# Opened source link
 
# Opened links for each of the four regions in Florida
 
# Counted incubators listed in each of the four regions to get information on 66 incubators
 
# Data
 
:* main site contains links to four regions in Florida, each region contains the following data:
 
::* incubator name, address, phone number and link to home page
 
'''Review'''
 
* Provides reliable links to incubators within Florida
 
* Benefit: data is filtered to include only incubators
 
* Limitation: may be challenging for a web crawler to navigate, as main page has links to regions which has links to home pages of incubators
 
* Limitation: only incubators within Florida
 
==Source: https://www.louisianaincubation.org/current-members Louisiana Business Incubation Association==
 
# Opened source link
 
# Copied the data on incubators into a text editor and searched for how many times the word "E-Mail" appeared
 
# Site contained data on 28 incubators in the state of Louisiana
 
# Data
 
:* Main site contains
 
:* incubator name
 
:* contact name
 
:* address and phone number
 
:* link to website
 
'''Review'''
 
* Provides reliable links to incubators within Louisiana
 
* Benefit: data is filtered to include only incubators
 
* Limitation: may be challenging for a web crawler to navigate, as main page does not contain information on incubator but rather has links to home pages of incubators
 
* Limitation: only incubators within Louisiana
 
==Source: http://incubatemaryland.org/incubators/ Maryland Business Incubation Association==
 
# Opened source link
 
# Counted number of incubators listed on the page and found information on 35 incubators
 
# Data
 
:* Main site contains incubator name, short description and link to another page within main site read more which contains:
 
::* link to incubate home page
 
'''Review'''
 
* Provides reliable links to incubators within Maryland
 
* Benefit: data is filtered to include only incubators
 
* Limitation: may be challenging for a web crawler to navigate, as main page contains links internal to the site which then link to the home pages of incubators
 
* Limitation: only incubators within Maryland
 
==Source: https://www.massincubators.org/ Massachusetts Association of Business Incubators==
 
# Opened source link
 
# Counted number of incubators listed on the page and found information on 20 incubators
 
# Data
 
:* Main site contains incubator name, short description and link to the incubator's home page
 
'''Review'''
 
* Provides reliable links to incubators within Massachusetts
 
* Benefit: data is filtered to include only incubators
 
* Limitation: only incubators within Massachusetts
 
  
 +
The National Business Incubation Association maintains a list of [https://inbia.org/services/resources/ U.S. Incubation Associations]. We went through this list and evaluated each association as a potential data source. These sites generally contain a list of incubators that are working in collaboration with the NBIA and are within that specific state. The sites could be useful in cross-referencing data pulled from the NBIA main database as some of the incubators listed on the state specific websites are not in the main NBIA database. They could also be helpful in cross-referencing data pulled from other main databases as these sites have reliable links, are filtered to include only incubators, and have a relatively consistent format.
 +
{| class="wikitable"
 +
|-
 +
! Source
 +
! Directions
 +
! How many?
 +
! Region
 +
! Data
 +
! Benefits
 +
! Limitations
  
 +
|-
 +
| [http://asbdc.org/start-ups/incubators-in-alabama/ Alabama Business Incubation Network]
 +
| Opened source link and counted incubators listed on the home page
 +
| 12
 +
| Alabama
 +
| Incubator Name, Brief Description, and a link to the home page
 +
| Reliable links that are filtered to include only incubators
 +
| only contains information on incubators in Alabama that are associated with NBIA
 +
|-
 +
| [http://www.fbiaonline.org/Incubators/incubators.htm Florida Business Incubation Association]
 +
| Opened source link and then opened links for each of the four regions in Florida
 +
| 66
 +
| Florida
 +
| source link contains 4 links to the regions in Florida, each region contains incubator name, address, and a link to the home page
 +
| Provides reliable links. Filtered to include only information on incubators
 +
| May be challenging for a web crawler to navigate because it is separated by region. Only provides information about Florida incubators.
 +
|-
 +
| [https://www.louisianaincubation.org/current-members Louisiana Business Incubation Association]
 +
| Opened source link. Copied the data on incubators into a text editor and search for how many times the word "E-Mail" appeared
 +
| 28
 +
| Louisiana
 +
|
 +
* incubator name
 +
* contact name
 +
* address and phone number
 +
* link to website
 +
| data is filtered to include only incubators, links are reliable
 +
| only incubators in state of Louisiana, limited data set
 +
|-
 +
| [http://incubatemaryland.org/incubators/ Maryland Business Incubation Association]
 +
| Opened source link and counted number of incubators listed on the page
 +
| 35
 +
| Maryland
 +
| Main site contains incubator name, short description, and link to another page within main site with contains a link to the incubator home page
 +
| Reliable links, filtered to include only incubators
 +
| It would be challenging for a web crawler to navigate, as the main page contains links internal to the site which then link to the home pages of incubators. Limited dataset with only incubators in Maryland.
 +
|-
 +
| [https://www.massincubators.org/ Massachusetts Association of Business Incubators]
 +
| Open source link and count number of incubators listed on the page
 +
| 20
 +
| Massachusetts
 +
| incubator name, short description, and link to incubator home page
 +
| reliable links, only data on incubators
 +
| limited dataset
 +
|-
 +
|[https://bostonstartupsguide.com/guide/every-boston-startup-accelerator-incubator/ Boston Startup Guide]
 +
| Scrolled down to the section labeled "Startup incubators in Boston"
 +
| 10
 +
| Boston
 +
|
 +
*Company Name and URL
 +
* Capital Provided & equity taken
 +
* Application Process
 +
| reliable links
 +
| relatively unformatted data that would be challenging to use. Limited in scope
 +
|-
 +
| [https://www.viethconsulting.com/members/googlemaps/google_maps.php?mode=normal&orgcode=MBIA Michigan Business Innovation Association]
 +
| Open source link and count number of incubators listed in the column next to the map
 +
| 15
 +
| Michigan
 +
| incubator name, address, link to location on map, and link to incubator home page
 +
| reliable links, only data on incubators
 +
| limited dataset
 +
|-
 +
| [https://livefreeandstart.com/resources/incubators-makerspaces/ NH Tech Alliance ]
 +
| Open source link and count organizations listed under "NHBIN Member Locations"
 +
| 8
 +
| New Hampshire
 +
| incubator name, town within NH, brief description, and link to home page
 +
| reliable links only data on incubators
 +
| limited dataset, not very structured organization on website
 +
|-
 +
| [http://www.ncincubation.org/NCIncubators.aspx NC Business Incubation Association]
 +
| Open source link, click on each county and count the number of business incubators
 +
| 32
 +
| North Carolina
 +
| Incubator name, address, program directors, and link
 +
| only data on incubators
 +
| limited dataset, hard to navigate site with web crawler, some of the incubators do not have links
 +
|-
 +
| [https://www.okbia.org/our-members Oklahoma Business Incubator Association]
 +
| Open source link and count the number of incubators
 +
| 29
 +
| Oklahoma
 +
| Incubator name and link to it
 +
| reliable links, only data on incubators
 +
| limited dataset
 +
|-
 +
| [https://dmped.dc.gov/page/incubators-accelerators-and-co-working-spaces Incubators/Accelerators In DC]
 +
| Open source link and count the number of incubators, I did not include co-working spaces
 +
| 15
 +
| DC
 +
| Incubator name and link to it and brief description
 +
| reliable links, helpful description
 +
| limited dataset, mix of incubators and other organizations
 +
|}
  
 
+
==[[Accelerator_Seed_List_(Data)#Sources | Accelerator Data Sources]] that are not viable==
= [[Accelerator_Seed_List_(Data)#Sources | Accelerator Data Sources]] that are Potentially Viable =
 
==Source: http://www.acceleratorinfo.com/see-all.html==
 
# Opened source link
 
# Copied links from first column (“All Startup Support Programs”) into excel and returned 215 results
 
# Copied links from second column (“All University Programs”) into excel and returned 249 results)
 
# Each link on parent list leads to individual '''home page url''' of organization
 
'''Review'''
 
* Provides only links, does not separate between incubator and accelerator, some of the university supported programs may not be considered either an incubator or an accelerator
 
==Source: https://bostonstartupsguide.com/guide/every-boston-startup-accelerator-incubator/==
 
# Scrolled down to the section labeled "Startup incubators in Boston"
 
# Counted the number of incubators in Boston (10)
 
# Data
 
:# Company Name and URL
 
:# Capital Provide
 
:# Equity taken
 
:# Application Process
 
'''Review'''
 
* The data is relatively unformatted and would be a challenge to use
 
* It is limited in scope to the Boston Area and only provides information on 10 incubators
 
 
 
 
 
 
 
 
 
=[[Accelerator_Seed_List_(Data)#Sources | Accelerator Data Sources]] that are not viable=
 
 
* '''Source:''' http://www.seed-db.com/accelerators
 
* '''Source:''' http://www.seed-db.com/accelerators
 
:* ''Reason'': does not include information on incubators
 
:* ''Reason'': does not include information on incubators
Line 170: Line 298:
 
:* ''Reason'': this website is no longer active, the link will not work
 
:* ''Reason'': this website is no longer active, the link will not work
 
:*  ''Learn More'': [[Accelerator_Seed_List_(Data)#Source:_https:.2F.2Fwww.corporate-accelerators.net.2Fdatabase.2F | Previous Research]]
 
:*  ''Learn More'': [[Accelerator_Seed_List_(Data)#Source:_https:.2F.2Fwww.corporate-accelerators.net.2Fdatabase.2F | Previous Research]]
 +
 +
*'''Source:''' https://www.gan.co/engage/accelerators/
 +
:*Reason'': does not include information on incubators
 +
:*''Learn More'': https://www.brookings.edu/research/accelerating-growth-startup-accelerator-programs-in-the-united-states/
  
 
* '''Source:''' https://github.com/florianheinemann/www-corporate-accelerators-net/blob/master/_data/Accelerators.json
 
* '''Source:''' https://github.com/florianheinemann/www-corporate-accelerators-net/blob/master/_data/Accelerators.json
 
:* ''Reason'': does not include information on incubators
 
:* ''Reason'': does not include information on incubators
 
:*  ''Learn More'': [[Accelerator_Seed_List_(Data)#Source:_https:.2F.2Fgithub.com.2Fflorianheinemann.2Fwww-corporate-accelerators-net.2Fblob.2Fmaster.2F_data.2FAccelerators.json| Previous Research]]
 
:*  ''Learn More'': [[Accelerator_Seed_List_(Data)#Source:_https:.2F.2Fgithub.com.2Fflorianheinemann.2Fwww-corporate-accelerators-net.2Fblob.2Fmaster.2F_data.2FAccelerators.json| Previous Research]]
 +
 +
==Other Sources Not Yet Explored==
 +
 +
We found the following sources in the process of other work:
 +
*https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/706581/Business-incubators-accelerators-directory-update.xlsx
 +
 +
==Assembling the data==
 +
 +
The data is assembled in the dbase '''incubators''' from the following national sources, all copied in E:\projects\Kauffman Incubator Project\Incubator Data Assembly:
 +
*456 in CrunchbaseIncubators.txt, see [[Crunchbase_Database#Incubators_in_Crunchbase]]
 +
*415 in INBIA_data.txt, see [[INBIA#Retrieve_Data_from_URLs_Generated]]
 +
*771 in angelList_companyInfo-selfdeclared.txt, see [[AngelList_Database#Parsing_Saved_AngelList_Pages]]. Note that the AngelList data also has angelList_employees.txt and angelList_portfolio.txt as associated files, and that a broader file of candidate incubators, angelList_companyInfo.txt is also available. For self-declaration, we insisted that they called themselves an incubator in either their headline or category, and did not call them self an accelerator, VC, or event. We also excluded virtual incubators and those doing social entrepreneurship. See the Excel spreadsheet for restrictions. AngelList locations were processed into city and state in a separate file. Non-US were then excluded, reducing the count to 733.
 +
 +
The load and processing script is '''Incubators.sql''' in E:\projects\Kauffman Incubator Project\
 +
 +
This results in table '''CIAIncubators''' and text file '''CIAIncubators.txt''', which contains 1603 records with the following fields and coverage:
 +
*orgname --1603
 +
*statecode --1600
 +
*url --1584
 +
*description --1188
 +
*city --1591
 +
*address --769
 +
*zip --415
 +
 +
We also have three sources that have a mix of types, which are not yet loaded into this data:
 +
*361 (with some non-incubators) in Gaebler.txt
 +
*292 (very mixed type) in ClusterMapping.txt
 +
*21 (very mixed type) in Wharton.txt
 +
 +
The CIA data is then combined with [[US Incubators]] data, which is separately available in '''USIncubators.txt''', and everything is matched using name based matching to try to remove duplicates (within states) and produce the best information. The result can then be matched back to Crunchbase. There were 2155 distinct orgnames, 37 of which had internal name matches.
 +
perl Matcher.pl -mode=2 -file1="DistinctIncubatorOrgNames.txt" -file2="DistinctIncubatorOrgNames.txt"
 +
 +
The result is the table '''Incubators''' and text file '''Incubators.txt''' with 2137 records and the following coverage:
 +
*orgnamestd --2137
 +
*orgname --2137
 +
*statecode --2137
 +
*url --2031
 +
*description --1447
 +
*city --1955
 +
*address --970
 +
*zip --624
 +
*source --2137
 +
 +
The URL field was then processed using the cleanurl function to create WHOIS parsable domains. A new table called IncubatorWCount was created combining the information in Incubators with the counts of distinct domains. This was then processed by hand in Excel. The resulting clean file was re-imported as IncubatorsProcessed, and restricted to keep=1 in IncubatorsClean. The result has 1999 records with the following coverage:
 +
*statecode --1999
 +
*url --1872
 +
*description --1389
 +
*city --1854
 +
*address --909
 +
*zip --578

Latest revision as of 12:41, 21 September 2020


Project
Incubator Seed Data
Project logo 02.png
Project Information
Has title Incubator Seed Data
Has owner Anne Freeman
Has start date
Has deadline date
Has project status Active
Is dependent on Crunchbase Database, INBIA, Google Crawler
Does subsume Incubator Seed Data Coverage
Subsumed by: Ecosystem Organization Classifier
Has sponsor Kauffman Incubator Project
Has project output Data
Copyright © 2019 edegan.com. All Rights Reserved.

Requirement: Determine at least 4 primary data sources, or secure licenses to extract ‘seed data’ from these sources, as measured by program records.

Status: We have identified at least 4 primary data sources. Crunchbase is our biggest structured source for incubators, and we have a license for Crunchbase Pro. Our other two structured sources are AngelList and INBIA. Given the paucity of strong sources, we decided to use a custom Google Crawler as a source. We will also be creating a new VentureXpert Database using data drawn from SDC Platinum, so that we have a source of information on venture capital backed startup firms.

Goal

We will evaluate data sources based on the number of incubators they have data on and the type of information they supply on these incubators. We will also record whether or not these data sources collect information on any other types of entrepreneurship organizations. Ideally these data sources would provide some or all of the variables that were identified as most important for identifying incubators (Formulate_baseline_attributes). However, it is unlikely that one data source will contain all of the baseline attributes identified, therefore if the data source can provide links to a large quantity of incubators or in-depth descriptions, they could still be viable.

Chosen Sources

Our primary incubator datasources are:

The Google Crawler was added as, with the exceptions of Crunchbase and AngelList, the structured sources are all small. It's coverage is superb.

In addition, we will be using the sources listed below, and VentureXpert Database as a primary reference seed source (to see whether client companies received venture capital).

Evaluation of Main Sources

Source Directions How many? Data Benefits Limitations
Whartoneclub Incubators
  • Opened source link.
  • Copied results from "U.S. Based Incubators" into excel spreadsheet.
21
  • Name, City, State
  • Url to home page of incubator
Links to the home page of incubator May not be able to get specific information from home page. Limited list of incubators. Some organizations listed may not fall under our definition of an incubator (eg. Y Combinator)
InterNational Business Incubation Association or see our INBIA page
  • Opened source link
  • Entered "United States" for country and clicked "Find Companies"
415
  • Company Name and address
  • Link to another page within inbia on that page there is a link to the incubator's homepage
The database contains information on a lot of economic development institutions and would provide a mass quantity of data Challenging for web crawler as link connects to another page within inbia and then link on that page connects to company's homepage. Not all of institutions listed are incubators.

Out of the first ten links there were: 4 incubators, 2 educational programs, 1 broken link, 3 other economic development programs

Clustermapping Opened Source Link 292
  • Company name with link to a separate page within cluster mapping
  • on that page there is a link to the incubator's website
Provides a long list of entrepreneurship organizations Often data is missing off of the separate page for the company, including the URL to the company's website. The description is often not detailed enough to determine the category for the economic organization without going to the company's website. Different types of entrepreneurship organizations are mixed together.

Using the first 10 links, three were accelerators, six were missing links (two were self-proclaimed incubators in description), and one was another type of support organization.

The MBA Is Dead
  • Opened source link.
  • Selected "Region" >> "US & Canada"
186 Results
  • Click on each accelerator/incubator to get data
  • City and Country
  • low equity, high offer, high value
  • high equity, low offer, low value
  • link to company homepage
  • categories of companies it accelerates/incubates
Can search by region or by category of companies Seems to be a lot of data on accelerators and fewer incubators included

Out of the first 10 unique company links -- 1 was a broken link, 7 were accelerators, and 2 could possibly be incubators

AngelList
  • Opened source link.
  • Typed "incubator" in the search box
  • Clicked on "Search for 'incubator'
1,444 Results
  • Click on each incubator to get data
  • City and Categories
  • Number of Employment and URL
Can use key word "incubator" to filter data Contains some hybrid of incubator and accelerator
Gaebler
  • Opened source link.
  • Browsed a list of incubators by state
360 Results URL, incubator name Well-organized list of incubators by state. Data is in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\Gaebler\Results.txt and the script to retrieve the results is in the same director and called Gaebler.py. It only provides URL and incubator name; contains bad links

These main sources were found with Google Searches that included keywords like "incubator database", "us business incubators database", and others.

Accelerator Data Sources that are Potentially Viable

Source Directions How many? Data Benefits Limitations
Accelerator Info
  • First column is “All Startup Support Programs” (215)
  • Second column is “All University Programs” (249)
464 in total Each link on parent list leads to individual home page url of organization (that's all it is) Mixed information on incubators and accelerators. Some of the university supported programs may not be considered either an incubator or an accelerator and the data has non-trivial classification problems!

Out of the first 10 links, 3 bad links, 3 potential incubators, and 4 accelerators

Galidata Filter by Region: North America 584 "accelerators"
  • Company Name
  • Link to homepage
  • Location
  • Short Description (often blank)
  • Region
  • URL
Reliable links directly to homepage of companies, can search "U.S. and Canada" or other regions. Mix of incubators and accelerators. Needs a custom crawler. Description field is too unreliable for classification.

Out of the first 10 organizations in the US -- 6 were accelerators and 4 could potentially be incubators.

Crunchbase Database See the Crunchbase Database project page for more information.
S-B-Z Open and copy and paste into excel then clean up 143 Contains Name, URL, Description, Industry, Type, City, State In E:\projects\Kauffman Incubator Project, as excel and txt Mostly accelerators but contains a classifiable description field.

Outside of the U.S., there is the UK Business Incubators and Accelerators Directory, which is saved in E:\projects\Kauffman Incubator Project\Business-incubators-accelerators-directory-update.xlsx. It has 216 incubator records (including 'Incubator (University Enterprise Zone)') with a record for each location of an incubator.

Region Specific Incubator Sources

See US Incubators, which extends the notes in this section with data collection.

Many state and local governments contain information on incubators and accelerators that operate within their jurisdiction. They do not provide comprehensive sources on all incubators within the US but could be helpful as sources to cross-reference with a larger database.

The National Business Incubation Association maintains a list of U.S. Incubation Associations. We went through this list and evaluated each association as a potential data source. These sites generally contain a list of incubators that are working in collaboration with the NBIA and are within that specific state. The sites could be useful in cross-referencing data pulled from the NBIA main database as some of the incubators listed on the state specific websites are not in the main NBIA database. They could also be helpful in cross-referencing data pulled from other main databases as these sites have reliable links, are filtered to include only incubators, and have a relatively consistent format.

Source Directions How many? Region Data Benefits Limitations
Alabama Business Incubation Network Opened source link and counted incubators listed on the home page 12 Alabama Incubator Name, Brief Description, and a link to the home page Reliable links that are filtered to include only incubators only contains information on incubators in Alabama that are associated with NBIA
Florida Business Incubation Association Opened source link and then opened links for each of the four regions in Florida 66 Florida source link contains 4 links to the regions in Florida, each region contains incubator name, address, and a link to the home page Provides reliable links. Filtered to include only information on incubators May be challenging for a web crawler to navigate because it is separated by region. Only provides information about Florida incubators.
Louisiana Business Incubation Association Opened source link. Copied the data on incubators into a text editor and search for how many times the word "E-Mail" appeared 28 Louisiana
  • incubator name
  • contact name
  • address and phone number
  • link to website
data is filtered to include only incubators, links are reliable only incubators in state of Louisiana, limited data set
Maryland Business Incubation Association Opened source link and counted number of incubators listed on the page 35 Maryland Main site contains incubator name, short description, and link to another page within main site with contains a link to the incubator home page Reliable links, filtered to include only incubators It would be challenging for a web crawler to navigate, as the main page contains links internal to the site which then link to the home pages of incubators. Limited dataset with only incubators in Maryland.
Massachusetts Association of Business Incubators Open source link and count number of incubators listed on the page 20 Massachusetts incubator name, short description, and link to incubator home page reliable links, only data on incubators limited dataset
Boston Startup Guide Scrolled down to the section labeled "Startup incubators in Boston" 10 Boston
  • Company Name and URL
  • Capital Provided & equity taken
  • Application Process
reliable links relatively unformatted data that would be challenging to use. Limited in scope
Michigan Business Innovation Association Open source link and count number of incubators listed in the column next to the map 15 Michigan incubator name, address, link to location on map, and link to incubator home page reliable links, only data on incubators limited dataset
NH Tech Alliance Open source link and count organizations listed under "NHBIN Member Locations" 8 New Hampshire incubator name, town within NH, brief description, and link to home page reliable links only data on incubators limited dataset, not very structured organization on website
NC Business Incubation Association Open source link, click on each county and count the number of business incubators 32 North Carolina Incubator name, address, program directors, and link only data on incubators limited dataset, hard to navigate site with web crawler, some of the incubators do not have links
Oklahoma Business Incubator Association Open source link and count the number of incubators 29 Oklahoma Incubator name and link to it reliable links, only data on incubators limited dataset
Incubators/Accelerators In DC Open source link and count the number of incubators, I did not include co-working spaces 15 DC Incubator name and link to it and brief description reliable links, helpful description limited dataset, mix of incubators and other organizations

Accelerator Data Sources that are not viable

  • Reason: data is cluttered/messy, does not provide links to incubator websites and doesn't include enough information for evaluation without incubator url
  • Learn More: Previous Research
  • Reason: this website is no longer active, the link will not work
  • Learn More: Previous Research

Other Sources Not Yet Explored

We found the following sources in the process of other work:

Assembling the data

The data is assembled in the dbase incubators from the following national sources, all copied in E:\projects\Kauffman Incubator Project\Incubator Data Assembly:

  • 456 in CrunchbaseIncubators.txt, see Crunchbase_Database#Incubators_in_Crunchbase
  • 415 in INBIA_data.txt, see INBIA#Retrieve_Data_from_URLs_Generated
  • 771 in angelList_companyInfo-selfdeclared.txt, see AngelList_Database#Parsing_Saved_AngelList_Pages. Note that the AngelList data also has angelList_employees.txt and angelList_portfolio.txt as associated files, and that a broader file of candidate incubators, angelList_companyInfo.txt is also available. For self-declaration, we insisted that they called themselves an incubator in either their headline or category, and did not call them self an accelerator, VC, or event. We also excluded virtual incubators and those doing social entrepreneurship. See the Excel spreadsheet for restrictions. AngelList locations were processed into city and state in a separate file. Non-US were then excluded, reducing the count to 733.

The load and processing script is Incubators.sql in E:\projects\Kauffman Incubator Project\

This results in table CIAIncubators and text file CIAIncubators.txt, which contains 1603 records with the following fields and coverage:

  • orgname --1603
  • statecode --1600
  • url --1584
  • description --1188
  • city --1591
  • address --769
  • zip --415

We also have three sources that have a mix of types, which are not yet loaded into this data:

  • 361 (with some non-incubators) in Gaebler.txt
  • 292 (very mixed type) in ClusterMapping.txt
  • 21 (very mixed type) in Wharton.txt

The CIA data is then combined with US Incubators data, which is separately available in USIncubators.txt, and everything is matched using name based matching to try to remove duplicates (within states) and produce the best information. The result can then be matched back to Crunchbase. There were 2155 distinct orgnames, 37 of which had internal name matches.

perl Matcher.pl -mode=2 -file1="DistinctIncubatorOrgNames.txt" -file2="DistinctIncubatorOrgNames.txt"

The result is the table Incubators and text file Incubators.txt with 2137 records and the following coverage:

  • orgnamestd --2137
  • orgname --2137
  • statecode --2137
  • url --2031
  • description --1447
  • city --1955
  • address --970
  • zip --624
  • source --2137

The URL field was then processed using the cleanurl function to create WHOIS parsable domains. A new table called IncubatorWCount was created combining the information in Incubators with the counts of distinct domains. This was then processed by hand in Excel. The resulting clean file was re-imported as IncubatorsProcessed, and restricted to keep=1 in IncubatorsClean. The result has 1999 records with the following coverage:

  • statecode --1999
  • url --1872
  • description --1389
  • city --1854
  • address --909
  • zip --578