US Incubators

From edegan.com
Jump to navigation Jump to search


Project
US Incubators
Project logo 02.png
Project Information
Has title US Incubators
Has owner Yi Ma
Has start date
Has deadline date
Has project status Active
Copyright © 2019 edegan.com. All Rights Reserved.


Objective

The objective of this project is to assemble a near-population dataset on U.S. incubators! This project uses the Incubator Seed Data.

File Location

E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\YiMaResearch\US Incubators

Notes:

  • Highlighted rows need to be deleted
  • The format of zip code field is text

Progress

Extract incubator data from data on national resources

  • National Data
Source Progress How many? Data Method
Whartoneclub Incubators Done 21
  • url
  • company name
  • city
  • state
regular expression
InterNational Business Incubation Association or see our INBIA page Done 415
  • Company Name, address, ,city, state, zip code, country, url and contact person
regular expression
Clustermapping Done 292
  • Company name, description, address 1, address 2, city, state, zip code
regular expression
The MBA Is Dead Link doesn't work 186 Results
  • City and Country
  • low equity, high offer, high value
  • high equity, low offer, low value
regular expression
Gaebler Done 360 Results
  • incubator name
  • url
regular expression
  • Gaebler incubator list is in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\Gaebler\Results.txt and the script to retrieve the results is in the same director and called Gaebler.py.

Extract incubator data from data on regional resources

Source Progress How many? Region Data Method
Alabama Business Incubation Network Done 12 Alabama Incubator Name, URL, and Brief Description regular expression
IdeaGist - Alasak Done 1 Alaska Company name, URL, City, State Manual Collection
Florida Business Incubation Association Done 72 Florida incubator name, address, city, state, phone number and url regular expression
Louisiana Business Incubation Association Done 25 Louisiana
  • incubator name
  • contact name
  • address and phone number
  • link to website
regular expression
Maryland Business Incubation Association Done 35 Maryland Incubator name, short description, and link to another page within main site with contains a link to the incubator home page regular expression
Massachusetts Association of Business Incubators Done 21 Massachusetts incubator name, short description, and link to incubator home page regular expression
Boston Startup Guide Done 10 Boston
  • Company Name and URL
  • Capital Provided & equity taken
  • Application Process
regular expression
Michigan Business Innovation Association Done 15 Michigan company name, ulr, address, url, city, state, zip code regular expression
NH Tech Alliance Done 10 New Hampshire company name, city, url, brief description regular expression
NC Business Incubation Association Done 33 North Carolina Incubator name, address, contact, title, phone number, url and email Manual Data Collection
Oklahoma Business Incubator Association Done 34 Oklahoma Incubator name and link to it regular expression
Incubators/Accelerators In DC Done 55* DC Incubator name and link to it and brief description regular expression
High Tech News and Information for South California Done 34 California Url, company name, description, city, state regular expression
Leagal Counsel to Entrepreneurs and Emerging Growth Companies Done 25 Los Angeles Url, company name, city, state, description regular expression
IdeaGist - Colorado Done 8 Colorado Company name, url, location Manual collection
IdeaGist - Connecticut Done 7 Connecticut Company name, url, location Manual collection
Delaware Business Times Done 11 Delaware URL, company name, address, city, state code, phone number, email, description Regular expression
Washington State Department of Commerce Done 25 WA Url, company name, address, city, state, zipcode manual collection
Seattle Incubators Done 10 Seattle Company name, url, description regular expression
Digital NYC Done 25 NYC Company name, description regular expression
Idaho Commerce Done 14 Idaho URL, company name, city regular expression
Business Oregon Done 25 Oregon Company name, address, city, state, zip code, service area, description regular expression
Tech.co Done 16 Arizona URL, Company name, description regular expression
Arkansas Inc Done 3 Arkansas URL, Company name, description Regular Expression

Notes:

  • DC includes both incubators and accelerators
  • Oregon includes both incubators and accelerators
  • Arizona includes both incubators and accelerators
  • Clustermapping contains non-US data. They have been highlighted in the spreadsheet

Retrieving Incubators from Crunchbase Database

We are pulling out relevant fields from crunchbase database using incubator uuids chosen by Yi and Libby following the process:

1) Create a file of uuids of incubators

  • CrunchbaseShortOrgDescChosenByYi.txt (275)
  • CrunchbaseShortOrgDescChosenByLibby.txt (301)
File path: Z:\crunchbase3

2) Load the file into the database

DROP TABLE ChosenShortOrgUUIDs;
CREATE TABLE ChosenShortOrgUUIDs (
 uuid varchar(100)
);
\COPY ChosenShortOrgUUIDs FROM 'CrunchbaseShortOrgDescChosenByYi.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--275
DROP TABLE ChosenLongOrgUUIDs;
CREATE TABLE ChosenLongOrgUUIDs (
 uuid varchar(100)
);
\COPY ChosenShortOrgUUIDs FROM 'CrunchbaseShortOrgDescChosenByLibby.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--301

3) Run a query that joins uuids with related fields

Fields we are interested in:

company_name, domain, homepage_url, country_code, state_code, region, city, address, status, short_description, category_list, category_group_list, funding_rounds, funding_total_usd, founded_on, employee_count, A.uuid, primary_role, type

4) Resulting files are in:

Z:\crunchbase3

File names: ChosenLongOrgResults.txt, ChosenShortOrgResults.txt

Useful Regular Exes

1. Replace “\s+$” with [leave blank] to remove all the empty lines

2. Replace "s+$" with [leave blank] to removes all the whitespaces

3. <.*> finds everything that starts with < and ends with >

4. Replace href=" with "\n" to start a new line for each url

5. Replace "\s\s+" with [leave blank] to remove more than one white spaces

6. Replace "(?-s)^(.+)\R(.+)\R(.+)\R(.+)\R(.+)\R(.+)\R" with "\1\2\3\4\5\6\r\n" to merge every six lines

7. Replace "[ ]{2,}" with [leave blank] removes more than one spaces between two words

8. Crtl+Q, B turns on the block select mode

9. Replace " .*" with [leave black] to remove noncharacters

Useful PostgreSQL Script

Loading/Unloading Data

Always load/unload data using the PostgreSQL specific copy function below. Always load tab-delimited data that is UTF-8 encoded, with PC or UNIX line endings, and that has a header row. NEVER DEVIATE FROM THIS (unless there is a VERY good reason, like the source data is huge and comes preformatted differently).

Load using: \COPY tablename FROM 'filename.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV

Unload (copy to txt file) using: \COPY tablename TO 'filename.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV

Creating Tables

DROP TABLE tablename;
 
CREATE TABLE tablename (
 field1 varchar(100),
 field2 int,
 field3 date,
 field4 real
);