Crunchbase Database

From edegan.com
Revision as of 19:41, 22 March 2019 by Hiep (talk | contribs)
Jump to navigation Jump to search


Project
Crunchbase Database
Project logo 02.png
Project Information
Has title Crunchbase Database
Has owner Hiep Nguyen
Has start date 2019/03/13
Has deadline date 2019/03/22
Has project status Active
Dependent(s): Ecosystem Organization Classifier, Incubator Seed Data
Copyright © 2019 edegan.com. All Rights Reserved.


Files and Dbase

Files are in:

  • E:\projects\crunchbase3
  • Z:\crunchbase3

Dbase is crunchbase3

The old project page is Crunchbase Data. File locations listed as Z:/bulk/ should now be Z:/bulk/mcnair/. For example there is an old loadscript in /bulk/mcnair/crunchbase/crunchbaseData/load_crunchbase.sql


Crunchbase Pro

https://www.crunchbase.com/login

Login details:

  • mcnair@rice.edu getpasswordfromed

Getting and cleaning data

The url to make API calls is https://api.crunchbase.com/v3.1/csv_export/csv_export.tar.gz?user_key=[API KEY GOES HERE]

API key (premium) is located at E:\projects\crunchbase3

The command line (bash script) to get the data and extract the data (1.9gb) is at E:\projects\crunchbase3\get_data.sh

Alternatively, we can download and extract directly using windows command prompt by typing the following commands

curl -O https://api.crunchbase.com/v3.1/csv_export/csv_export.tar.gz?user_key=[API key goes here] \
      
tar -xvf csv_export.tar.gz_user_key=[API key goes here].

Current csv files from crunchbase data

data\acquisitions.csv
data\category_groups.csv
data\degrees.csv
data\events.csv
data\event_appearances.csv
data\funding_rounds.csv
data\funds.csv
data\investments.csv
data\investment_partners.csv
data\investors.csv
data\ipos.csv
data\jobs.csv
data\organizations.csv
data\organization_descriptions.csv
data\org_parents.csv
data\people.csv
data\people_descriptions.csv

The sql script get_data.sql from last year is copied to the current Crunchbase3 directory. However, two databases are very different now and adjustments are necessary. To keep track of the data type from each csv file used to copy to sql tables, a file get_type.py is included in E:\projects\crunchbase3. This python script will print the first 5 rows of every data frame in the current directory.

All the crunchbase3 data from drive E are now also in drive Z:/crunchbase3

A version of crunchbase3 database is live on the postgresql in Z:/crunchbase3. However, a few csv files have not been copied to the SQL database because of data type errors, which is a small problem but Hiep will need to spend some time to fix that. Hiep will work on it next week (March 28th).

Right now, a modification of load_crunchbase.sql is in both Z:/crunchbase3 and E:/projects/crunchbase3. Changes in dataset, datatype, and data columns have been made a lot compared to the previous version. The columns that are not yet added to the postgresql db are noted inside two lines of ################'s in the sql script. Since the data has changed a lot compared to last year, using \i load_crunchbase.sql was not very useful, one may need to copy one table at a time by pasting the sql script into the terminal.

Files that have not yet been copied to the postgresql server are

 degrees.csv
 events.csv
 funding_round.csv
 funds.csv
 investors.csv
 ipos.csv
 jobs.csv
 organizations.csv
 people.csv