Merging Existing Data with Crunchbase

From edegan.com
Jump to navigation Jump to search


McNair Project
Merging Existing Data with Crunchbase
Project logo 02.png
Project Information
Project Title Merging Existing Data with Crunchbase
Owner Connor Rothschild
Start Date
Deadline
Primary Billing
Notes
Has project status
Copyright © 2016 edegan.com. All Rights Reserved.


This page details the process of merging existing data with data pulled from Crunchbase.

Project Location

For the merge detailed in this page, our data was from:

/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx

and Crunchbase info can be found in:

/bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx

The data from Crunchbase, organized into tables, is in a script found at:

/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql

Process

Step One: Creating UUID Matches

We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information

Upon Ed's instruction, we then looked at companies in Crunchbase which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).

Upon refining our list based on recursive filtering, we found __ companies which match our data, and added UUIDs appropriately.

Step Two: Pulling Data

The necessary tables for this pull can be found at:

/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql

We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table organizations.

We pull based on the unique UUIDs found by Maxine, which can be found in the file:

/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx, in column W.

The table we get most Crunchbase data from looks like this:

DROP TABLE organizations;
CREATE TABLE organizations (
 company_name varchar(100),
 role varchar(255),
 permalink varchar(255),
 domain varchar(5000),
 homepage_url varchar(5000),
 country_code varchar(10),
 state_code varchar(2),
 region varchar(50),
 city varchar(100),
 address text,
 status varchar(50),
 short_description text,
 category_list text,
 category_group_list  text,
 funding_rounds integer,
 funding_total_usd money,
 founded_on date, --yyyy-mm-dd
 last_funding_on date, --yyyy-mm-dd
 closed_on date, --yyyy-mm-dd
 employee_count varchar(255),
 email varchar(255),
 phone text,
 facebook_url varchar(5000),
 linkedin_url varchar(5000),
 cb_url varchar(5000),
 logo_url varchar(5000),
 twitter_url varchar(5000),
 alias varchar(10000),
 uuid varchar(255),
 created_at date, --yyyy-mm-dd-hh-mm-s.s
 updated_at date, --yyyy-mm-dd-hh-mm-s.s
 primary_role varchar(255),
 type varchar(255)
);

From this list, we care about the following:

  • company_name,
  • domain,
  • country_code,
  • state_code,
  • city,
  • address,
  • status,
  • short_description,
  • category_list,
  • category_group_list,
  • founded_on,
  • employee_count,
  • linkedin_url,
  • uuid -- our primary key

We also want to get more information on organization descriptions. To do so, we can pull description from the table organization_descriptions, matching based on UUID.

We also, for the purposes of industry classification, want to pull category_name from the table category_groups, matching based on UUID.

Finally, it may be worthwhile to pull variables such as name, description, and started_on from the events table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.

Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.

Step Three: Merging

Of the data we've pulled from Crunchbase, we're interested in merging four columns with our existing data:

  • domain (to be merged with the empty cells of courl)
  • city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)
  • status (to be merged with the empty cells of costatus)
  • short_description and description from the table organization_descriptions (some combination to be merged with empty cells of codescription)

Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for Maxine Tao's industry classifier.

The other columns can be added to the end of our sheet as supplemental data.