Changes

Jump to navigation Jump to search
2,184 bytes added ,  13:44, 10 March 2020
|Has project status=Active
|Is dependent on=Crunchbase Database, INBIA, Google Crawler
|Does subsume=Incubator Seed Data Coverage,
}}
 
Requirement: Determine at least 4 primary data sources, or secure licenses to extract ‘seed data’ from these sources, as measured by program records.
*[[AngelList Database|AngelList]]
*[[Google Crawler]]
*[[Yi Ma]]'s work assembling [[US Incubators]], state-by-state, for this project
*ClusterMapping
*Wharton entrepreneurship club
*456 in CrunchbaseIncubators.txt, see [[Crunchbase_Database#Incubators_in_Crunchbase]]
*415 in INBIA_data.txt, see [[INBIA#Retrieve_Data_from_URLs_Generated]]
*1474 (self-declared as incubators but actually many different things) 771 in angelList_companyInfo-selfdeclared.txt, see [[AngelList_Database#Parsing_Saved_AngelList_Pages]]. Note that the AngelList data also has angelList_employees.txt and angelList_portfolio.txt as associated files, and that a broader file of candidate incubators, angelList_companyInfo.txt is also available. For self-declaration, we insisted that they called themselves an incubator in either their headline or category, and did not call them self an accelerator, VC, or event. We also excluded virtual incubators and those doing social entrepreneurship. See the Excel spreadsheet for restrictions.AngelList locations were processed into city and state in a separate file. Non-US were then excluded, reducing the count to 733. The load and processing script is '''Incubators.sql''' in E:\projects\Kauffman Incubator Project\ This results in table '''CIAIncubators''' and text file '''CIAIncubators.txt''', which contains 1603 records with the following fields and coverage: *orgname --1603*statecode --1600*url --1584*description --1188*city --1591*address --769*zip --415
We also have three sources that have a mix of types, which are not yet loaded into this data:
*361 (with some non-incubators) in Gaebler.txt
*292 (very mixed type) in ClusterMapping.txt*21 (very mixed type) in Wharton.txt The CIA data is then combined with [[US Incubators]] data, which is separately available in '''USIncubators.txt''', and everything is matched using name based matching to try to remove duplicates (within states) and produce the best information. The result can then be matched back to Crunchbase. There were 2155 distinct orgnames, 37 of which had internal name matches. perl Matcher.pl -mode=2 -file1="DistinctIncubatorOrgNames.txt" -file2="DistinctIncubatorOrgNames.txt" The result is the table '''Incubators''' and text file '''Incubators.txt''' with 2137 records and the following coverage:*orgnamestd --2137*orgname --2137*statecode --2137*url --2031*description --1447*city --1955*address --970*zip --624*source --2137 The URL field was then processed using the cleanurl function to create WHOIS parsable domains. A new table called IncubatorWCount was created combining the information in Incubators with the counts of distinct domains. This was then processed by hand in Excel. The resulting clean file was re-imported as IncubatorsProcessed, and restricted to keep=1 in IncubatorsClean. The result has 1999 records with the following coverage:*statecode --1999*url --1872*description --1389*city --1854*address --909*zip --578

Navigation menu