Changes

Jump to navigation Jump to search
5,952 bytes added ,  13:44, 10 March 2020
|Has owner=Anne Freeman,
|Has project status=Active
|Is dependent on=Crunchbase Database, INBIA, Google Crawler|Does subsume=Incubator Seed Data Coverage,
}}
 
Requirement: Determine at least 4 primary data sources, or secure licenses to extract ‘seed data’ from these sources, as measured by program records.
Status: We have identified at least 4 primary data sources. [[Crunchbase Database|Crunchbase]] as one is our biggest structured source for incubators, and we have a license for Crunchbase Pro. We are currently evaluating Our other two structured sources, as described on this pageare [[AngelList Database|AngelList]] and [[INBIA]]. Given the paucity of strong sources, we will likely decided to use a custom [[Google crawler (searching "incubator cityname" and similar) Crawler]] as a source. We will also be creating a new [[VentureXpert Database]] using data drawn from SDC Platinum, so that we have a source of information on venture capital backed startup firms.
==Goal==
We will evaluate data sources based on the number of incubators they have data on and the type of information they supply on these incubators. We will also record whether or not these data sources collect information on any other types of entrepreneurship organizations. Ideally these data sources would provide some or all of the variables that were identified as most important for identifying incubators ([[Formulate_baseline_attributes]]). However, it is unlikely that one data source will contain all of the baseline attributes identified, therefore if the data source can provide links to a large quantity of incubators or in-depth descriptions, they could still be viable.
==Chosen Sources==
 
Our primary incubator datasources are:
*[[Crunchbase Database|Crunchbase]]
*[[INBIA]]
*[[AngelList Database|AngelList]]
*[[Google Crawler]]
*[[Yi Ma]]'s work assembling [[US Incubators]], state-by-state, for this project
*ClusterMapping
*Wharton entrepreneurship club
*Gaebler
 
The [[Google Crawler]] was added as, with the exceptions of [[Crunchbase Database|Crunchbase]] and AngelList, the structured sources are all small. It's coverage is superb.
 
In addition, we will be using the sources listed below, and [[VentureXpert Database]] as a primary reference seed source (to see whether client companies received venture capital).
 
==Evaluation of Main Sources==
==Evaluation of Sources from Specific Google Searches==
*Searches included:
:* "incubator database"
:* "us business incubators database"
{| class="wikitable"
|-
| May not be able to get specific information from home page. Limited list of incubators. Some organizations listed may not fall under our definition of an incubator (eg. Y Combinator)
|-
| [http://exchange.inbia.org/network/findacompany/ National InterNational Business Incubation Association]or see our [[INBIA]] page
|
* Opened source link
| Can search by region or by category of companies
| Seems to be a lot of data on accelerators and fewer incubators included
 
Out of the first 10 unique company links -- 1 was a broken link, 7 were accelerators, and 2 could possibly be incubators
|-
| [https://angel.co/?ref=nav AngelList]
|
*Opened source link.
*Typed "incubator" in the search box
*Clicked on "Search for 'incubator'
| 1,444 Results
|
*Click on each incubator to get data
*City and Categories
*Number of Employment and URL
| Can use key word "incubator" to filter data
| Contains some hybrid of incubator and accelerator
|-
| [http://www.gaebler.com/Business-Incubator-Lists-By-State.htm Gaebler]
|
*Opened source link.
*Browsed a list of incubators by state
| 360 Results
| URL, incubator name
| Well-organized list of incubators by state. Data is in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\Gaebler\Results.txt and the script to retrieve the results is in the same director and called Gaebler.py.
| It only provides URL and incubator name; contains bad links
|}
 
These main sources were found with Google Searches that included keywords like "incubator database", "us business incubators database", and others.
== [[Accelerator_Seed_List_(Data)#Sources | Accelerator Data Sources]] that are Potentially Viable ==
| [http://www.acceleratorinfo.com/see-all.html Accelerator Info]
|
* Opened source link* Copied first First column (is “All Startup Support Programs”) into excel (215)* Copied second Second column (is “All University Programs”) into excel (249)| 464in total| Each link on parent list leads to individual '''home page url''' of organization| Reliable links, includes university supported programs(that's all it is)| Mixed information on incubators and accelerators. Some of the university supported programs may not be considered either an incubator or an acceleratorand the data has non-trivial classification problems! Out of the first 10 links, 3 bad links, 3 potential incubators, and 4 accelerators
|-
| [https://www.galidata.org/accelerators/directory/?keyword=&region=north_america Galidata]
| Filter by Region: North America
| 164584 "accelerators"
|
* Company Name
* Link to homepage
* Location
* Short Description(often blank)* Region* URL| reliable Reliable links directly to homepage of companies, can search within "U.S. and Canada" or other regions.| Mix of incubators and accelerators. Can only filter region to North AmericaNeeds a custom crawler. Description field is too unreliable for classification.Out of the first 10 organizations in the US -- 6 were accelerators and 4 could potentially be incubators.
|-
| [[:Crunchbase Database]]
|
|
|-
| [https://www.s-b-z.com/FORMING%20THE%20BUSINESS/db/accelerators.aspx S-B-Z]
| Open and copy and paste into excel then clean up
| 143
| Contains Name, URL, Description, Industry, Type, City, State
| In E:\projects\Kauffman Incubator Project, as excel and txt
| Mostly accelerators but contains a classifiable description field.
|}
Outside of the U.S., there is the UK Business Incubators and Accelerators Directory, which is saved in E:\projects\Kauffman Incubator Project\Business-incubators-accelerators-directory-update.xlsx. It has 216 incubator records (including 'Incubator (University Enterprise Zone)') with a record for each location of an incubator.
 
==Region Specific Incubator Sources==
'''See [[US Incubators]], which extends the notes in this section with data collection.'''
==Region Specific Incubator Sources==
Many state and local governments contain information on incubators and accelerators that operate within their jurisdiction. They do not provide comprehensive sources on all incubators within the US but could be helpful as sources to cross-reference with a larger database.
| reliable links, helpful description
| limited dataset, mix of incubators and other organizations
 
|}
 
 
 
 
 
 
==[[Accelerator_Seed_List_(Data)#Sources | Accelerator Data Sources]] that are not viable==
:* ''Reason'': does not include information on incubators
:* ''Learn More'': [[Accelerator_Seed_List_(Data)#Source:_https:.2F.2Fgithub.com.2Fflorianheinemann.2Fwww-corporate-accelerators-net.2Fblob.2Fmaster.2F_data.2FAccelerators.json| Previous Research]]
 
==Other Sources Not Yet Explored==
 
We found the following sources in the process of other work:
*https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/706581/Business-incubators-accelerators-directory-update.xlsx
 
==Assembling the data==
 
The data is assembled in the dbase '''incubators''' from the following national sources, all copied in E:\projects\Kauffman Incubator Project\Incubator Data Assembly:
*456 in CrunchbaseIncubators.txt, see [[Crunchbase_Database#Incubators_in_Crunchbase]]
*415 in INBIA_data.txt, see [[INBIA#Retrieve_Data_from_URLs_Generated]]
*771 in angelList_companyInfo-selfdeclared.txt, see [[AngelList_Database#Parsing_Saved_AngelList_Pages]]. Note that the AngelList data also has angelList_employees.txt and angelList_portfolio.txt as associated files, and that a broader file of candidate incubators, angelList_companyInfo.txt is also available. For self-declaration, we insisted that they called themselves an incubator in either their headline or category, and did not call them self an accelerator, VC, or event. We also excluded virtual incubators and those doing social entrepreneurship. See the Excel spreadsheet for restrictions. AngelList locations were processed into city and state in a separate file. Non-US were then excluded, reducing the count to 733.
 
The load and processing script is '''Incubators.sql''' in E:\projects\Kauffman Incubator Project\
 
This results in table '''CIAIncubators''' and text file '''CIAIncubators.txt''', which contains 1603 records with the following fields and coverage:
*orgname --1603
*statecode --1600
*url --1584
*description --1188
*city --1591
*address --769
*zip --415
 
We also have three sources that have a mix of types, which are not yet loaded into this data:
*361 (with some non-incubators) in Gaebler.txt
*292 (very mixed type) in ClusterMapping.txt
*21 (very mixed type) in Wharton.txt
 
The CIA data is then combined with [[US Incubators]] data, which is separately available in '''USIncubators.txt''', and everything is matched using name based matching to try to remove duplicates (within states) and produce the best information. The result can then be matched back to Crunchbase. There were 2155 distinct orgnames, 37 of which had internal name matches.
perl Matcher.pl -mode=2 -file1="DistinctIncubatorOrgNames.txt" -file2="DistinctIncubatorOrgNames.txt"
 
The result is the table '''Incubators''' and text file '''Incubators.txt''' with 2137 records and the following coverage:
*orgnamestd --2137
*orgname --2137
*statecode --2137
*url --2031
*description --1447
*city --1955
*address --970
*zip --624
*source --2137
 
The URL field was then processed using the cleanurl function to create WHOIS parsable domains. A new table called IncubatorWCount was created combining the information in Incubators with the counts of distinct domains. This was then processed by hand in Excel. The resulting clean file was re-imported as IncubatorsProcessed, and restricted to keep=1 in IncubatorsClean. The result has 1999 records with the following coverage:
*statecode --1999
*url --1872
*description --1389
*city --1854
*address --909
*zip --578

Navigation menu