Changes

Jump to navigation Jump to search
946 bytes added ,  13:41, 21 September 2020
no edit summary
{{Project
|Has project output=Data
|Has sponsor=Kauffman Incubator Project
|Has sponsor=Kauffman Incubator Project
|Has title=Incubator Seed Data
|Has owner=Anne Freeman,
|Has project status=Active
|Is dependent on=Crunchbase Database, INBIA, Google Crawler
|Does subsume=Incubator Seed Data Coverage,
}}
 
Requirement: Determine at least 4 primary data sources, or secure licenses to extract ‘seed data’ from these sources, as measured by program records.
*[[AngelList Database|AngelList]]
*[[Google Crawler]]
*[[Yi Ma]]'s work assembling [[US Incubators]], state-by-state, for this project
*ClusterMapping
*Wharton entrepreneurship club
The CIA data is then combined with [[US Incubators]] data, which is separately available in '''USIncubators.txt''', and everything is matched using name based matching to try to remove duplicates (within states) and produce the best information. The result can then be matched back to Crunchbase. There were 2155 distinct orgnames, 37 of which had internal name matches.
perl Matcher.pl -mode=2 -file1="DistinctIncubatorOrgNames.txt" -file2="DistinctIncubatorOrgNames.txt"
 
The result is the table '''Incubators''' and text file '''Incubators.txt''' with 2137 records and the following coverage:
*orgnamestd --2137
*orgname --2137
*statecode --2137
*url --2031
*description --1447
*city --1955
*address --970
*zip --624
*source --2137
 
The URL field was then processed using the cleanurl function to create WHOIS parsable domains. A new table called IncubatorWCount was created combining the information in Incubators with the counts of distinct domains. This was then processed by hand in Excel. The resulting clean file was re-imported as IncubatorsProcessed, and restricted to keep=1 in IncubatorsClean. The result has 1999 records with the following coverage:
*statecode --1999
*url --1872
*description --1389
*city --1854
*address --909
*zip --578

Navigation menu