Difference between revisions of "Merging Existing Data with Crunchbase"

From edegan.com
Jump to navigation Jump to search
Line 167: Line 167:
 
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.
 
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.
  
------Actual Run Info--------
+
'''Actual Run Info'''
 
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.
 
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.
  

Revision as of 13:27, 27 July 2018


McNair Project
Merging Existing Data with Crunchbase
Project logo 02.png
Project Information
Project Title Merging Existing Data with Crunchbase
Owner Connor Rothschild
Start Date
Deadline
Primary Billing
Notes
Has project status
Copyright © 2016 edegan.com. All Rights Reserved.


This page details the process of merging existing data with data pulled from Crunchbase.

Project Location

For the merge detailed in this page, our data was from:

/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx

and Crunchbase info can be found in:

/bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx

The data from Crunchbase, organized into tables, is in a script found at:

/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql

Code for building tables (from Maxine Tao):

E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql

And the database is

crunchbase2

Process

Step One: Creating UUID Matches

We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information

Upon Ed's instruction, we then looked at companies in Crunchbase which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).

Upon refining our list based on recursive filtering, we found 40 companies which match our data, and added UUIDs appropriately.

Step Two: Pulling Data

The necessary tables for this pull can be found at:

/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql

We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table organizations.

We pull based on the unique UUIDs found by Maxine, which can be found in the file:

/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx, in column W.

The table we get most Crunchbase data from looks like this:

DROP TABLE organizations;
CREATE TABLE organizations (
 company_name varchar(100),
 role varchar(255),
 permalink varchar(255),
 domain varchar(5000),
 homepage_url varchar(5000),
 country_code varchar(10),
 state_code varchar(2),
 region varchar(50),
 city varchar(100),
 address text,
 status varchar(50),
 short_description text,
 category_list text,
 category_group_list  text,
 funding_rounds integer,
 funding_total_usd money,
 founded_on date, --yyyy-mm-dd
 last_funding_on date, --yyyy-mm-dd
 closed_on date, --yyyy-mm-dd
 employee_count varchar(255),
 email varchar(255),
 phone text,
 facebook_url varchar(5000),
 linkedin_url varchar(5000),
 cb_url varchar(5000),
 logo_url varchar(5000),
 twitter_url varchar(5000),
 alias varchar(10000),
 uuid varchar(255),
 created_at date, --yyyy-mm-dd-hh-mm-s.s
 updated_at date, --yyyy-mm-dd-hh-mm-s.s
 primary_role varchar(255),
 type varchar(255)
);

From this list, we care about the following:

  • company_name,
  • domain,
  • country_code,
  • state_code,
  • city,
  • address,
  • status,
  • short_description,
  • category_list,
  • category_group_list,
  • founded_on,
  • employee_count,
  • linkedin_url,
  • uuid -- our primary key

We also want to get more information on organization descriptions. To do so, we can pull description from the table organization_descriptions, matching based on UUID.

We also, for the purposes of industry classification, want to pull category_name from the table category_groups, matching based on UUID.

Finally, it may be worthwhile to pull variables such as name, description, and started_on from the events table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.

Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.

Step Three: Merging

Of the data we've pulled from Crunchbase, we're interested in merging four columns with our existing data:

  • domain (to be merged with the empty cells of courl)
  • city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)
  • status (to be merged with the empty cells of costatus)
  • short_description and description from the table organization_descriptions (some combination to be merged with empty cells of codescription)

Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for Maxine Tao's industry classifier.

The other columns can be added to the end of our sheet as supplemental data.

SQL Scripts, Files, and Databases

The contents of E:\McNair\Projects\Accelerators\Summer 2018\For Ed Merge July 17.xlsx where copied into CohortCosWcbuuid.txt (in the Accelerators folder, as well as Z:/crunchbase2, Z:/../vcdb2).

The script AddCBData.sql loads this data into crunchbase2. It then outputs the relevant crunchbase data into CBCohortData.txt

The script LoadAcceleratorDataV2.sql (see around line 305) loads both CohortCosWcbuuid.txt and CBCohortData.txt into the database vcdb2. It then produces a CohortCoExtended table, which is output to a file.

Note that CohortCoExtended.txt includes a variable GotVC, which takes the value 1 if the cohort company got VC and zero otherwise:

 gotvc | count
-------+-------
     0 | 11465
     1 |  1504
(2 rows)

We now need to determine which cohort companies we have timing information for and which we don't - and use demo days to get the info we are missing!

Getting Timing info for Companies Who Got VC

Line 136 of

E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql

contains the code to find the companies which recieved VC but did not have timing info. There are 809 such companies. This table was exported into needtiminginfo.txt.

A list of distinct accelerators that we need timing data for was also created, which was given to Minh Le. There's 75 accelerators that need their timing doing.

Doing training data - 2,600 pages and are a little bit more than 1/2 way (~1500-1600).

Finding Company URLs

Excel master datasets are in:

E:\McNair\Projects\Accelerators\Summer 2018

Code and files specific to this URL finder are in:

E:\McNair\Projects\Accelerators\Summer 2018\url finder


RESULTS: I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.

In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):

E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx

We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs. Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.

To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.

It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.

Actual Run Info The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.

The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.

The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.

Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9.