Changes

946 bytes added , 13:41, 21 September 2020

no edit summary

{{Project

|Has project output=Data

|Has sponsor=Kauffman Incubator Project

|Has title=Incubator Seed Data

|Has owner=Anne Freeman,

|Has project status=Active

|Is dependent on=Crunchbase Database, INBIA, Google Crawler

|Does subsume=Incubator Seed Data Coverage,

}}

Requirement: Determine at least 4 primary data sources, or secure licenses to extract ‘seed data’ from these sources, as measured by program records.

*[[AngelList Database|AngelList]]

*[[Google Crawler]]

*[[Yi Ma]]'s work assembling [[US Incubators]], state-by-state, for this project

*ClusterMapping

*Wharton entrepreneurship club

The CIA data is then combined with [[US Incubators]] data, which is separately available in '''USIncubators.txt''', and everything is matched using name based matching to try to remove duplicates (within states) and produce the best information. The result can then be matched back to Crunchbase. There were 2155 distinct orgnames, 37 of which had internal name matches.

perl Matcher.pl -mode=2 -file1="DistinctIncubatorOrgNames.txt" -file2="DistinctIncubatorOrgNames.txt"

The result is the table '''Incubators''' and text file '''Incubators.txt''' with 2137 records and the following coverage:

*orgnamestd --2137

*orgname --2137

*statecode --2137

*url --2031

*description --1447

*city --1955

*address --970

*zip --624

*source --2137

The URL field was then processed using the cleanurl function to create WHOIS parsable domains. A new table called IncubatorWCount was created combining the information in Incubators with the counts of distinct domains. This was then processed by hand in Excel. The resulting clean file was re-imported as IncubatorsProcessed, and restricted to keep=1 in IncubatorsClean. The result has 1999 records with the following coverage:

*statecode --1999

*url --1872

*description --1389

*city --1854

*address --909

*zip --578

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Incubator Seed Data (view source)

Revision as of 13:41, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools