Changes

3,633 bytes added , 13:46, 21 September 2020

no edit summary

{{Project

|Has project output=Data,Tool,How-to

|Has sponsor=Kauffman Incubator Project

|Has title=Crunchbase Database

|Has owner=Hiep Nguyen

tar -xvf csv_export.tar.gz_user_key=[API key goes here].

Current csv files from crunchbase data

data\acquisitions.csv

data\category_groups.csv

data\degrees.csv

data\events.csv

data\event_appearances.csv

data\funding_rounds.csv

data\funds.csv

data\investments.csv

data\investment_partners.csv

data\investors.csv

data\ipos.csv

data\jobs.csv

data\organizations.csv

data\organization_descriptions.csv

data\org_parents.csv

data\people.csv

data\people_descriptions.csv

To keep track of the data type from each csv file used to copy to the SQL database, a file get_type.py is included in E:\projects\crunchbase3. This python script will print the first 5 rows of every data frame in the current directory.

All the crunchbase3 data from drive E are now also in drive Z:/crunchbase3

Since the data will be changing a lot compared to previous years, using \i load_crunchbase.sql might not very useful, and one may need to copy one table at a time by pasting the sql script into the terminal.

All the dataset (17 of them) from the API have been copied to the PostgreSQL server in drive Z under /bulk/crunchbase3. To make date-time format in postgres work properly, all the empty string with quotes ("") in CSV files have been replaced by NULL with the command line

sed 's/""//g' file.csv >file_clean.csv

The script that I used to do that is in the file clean_data.sh in E:/projects/crunchbase3. A shorter script to do that for all the files in the directory is possible but might not be necessary and not all files require such edit.

==Working with the database==

All the scripts in load_crunchbase.sql have been updated. It now works perfectly with the current data (as of 03/29/2019) crawled from crunchbaseAPI and includes the correct number of rows copied from the csv files at the end of each \COPY command.

To see and use the data in the postgres server:

1) Connect to reseacher@199.188.177.215. A password is required ( ask Prof Egan for details)

2) Go to /bulk/crunchbase3

cd /bulk/crunchbase3

3) Connect to the database

psql crunchbase3

\dt

4) Perform regular SQL queries

==Incubators in Crunchbase==

\COPY (SELECT uuid, company_name, short_description FROM Organizations WHERE country_code='USA' AND short_description LIKE '%incubat%') TO

'CrunchbaseShortOrgDescsUSAIncubat.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV

--466

\COPY (SELECT A.uuid, A.company_name, B.description FROM Organizations AS A JOIN organization_descriptions AS B on A.uuid=B.uuid WHERE

country_code='USA' AND description LIKE '%incubat%') TO 'CrunchbaseLongOrgDescsUSAIncubat.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV

--933

The two queries above were run against the Crunchbase database (see [[Ecosystem Organization Classifier]]), then their results were manually reviewed in two xlsx files (CrunchbaseLongOrgDescsUSAIncubat_IncubatorScore and CrunchbaseShortOrgDescsUSAIncubat_IncubatorScore), stored in E:\projects\crunchbase3

These files were then combined into IncubatorsFromCrunchbase.xlsx providing they scored 1 in the Long file or were marked keep and did not score 0 (social impact or virtual) in the Short file. The file has 564 (not necessarily unique) records and the following columns:

uuid company_name description Score Notes Source

RetrievingIncubators.sql was then modified to load this data, locate distinct UUIDs and output Organizational records. The resulting file is CrunchbaseIncubators.txt (456 unique records, all USA), which has the following fields:

company_name uuid address city state_code region status domain category_list short_description

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Crunchbase Database (view source)

Revision as of 13:46, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools