Changes

Jump to navigation Jump to search
no edit summary
{{Project
|Has project output=Data
|Has sponsor=McNair Center
|Has title=Merging Existing Data with Crunchbase
|Has owner=Connor Rothschild
}}
 
This page details the process of merging existing data with data pulled from Crunchbase.
The data from Crunchbase, organized into tables, is in a script found at:
/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql
 
Code for building tables (from [[Maxine Tao]]):
E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql
 
And the database is
crunchbase2
==Process==
===Step One: Creating UUID Matches===
We began my by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values.[[More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information]]
Upon Ed's instruction, we then looked at companies ''in Crunchbase'' which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).
Upon refining our list based on recursive filtering, we found __ 40 companies which match our data, and added UUIDs appropriately.
===Step Two: Pulling Data===
/bulk/McNair/Software/Database Scripts/Crunchbase2/'''LoadTables.sql'''
We then pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table ''organizations''.The table looks like this:
We pull based on the unique UUIDs found by Maxine, which can be found in the file: /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx''', in column W. The table we get most Crunchbase data from looks like this:  DROP TABLE organizations; CREATE TABLE organizations (
company_name varchar(100),
role varchar(255),
*employee_count,
*linkedin_url,
*uuid-- our primary key'''
We also want to get more information on organization descriptions. To do so, we can pull ''description'' from the table ''organization_descriptions'', matching based on UUID.
Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for [[Maxine Tao]]'s industry classifier.
 
The other columns can be added to the end of our sheet as supplemental data.
 
==SQL Scripts, Files, and Databases==
 
The contents of E:\McNair\Projects\Accelerators\Summer 2018\For Ed Merge July 17.xlsx where copied into CohortCosWcbuuid.txt (in the Accelerators folder, as well as Z:/crunchbase2, Z:/../vcdb2).
 
The script '''AddCBData.sql''' loads this data into '''crunchbase2'''. It then outputs the relevant crunchbase data into '''CBCohortData.txt'''
 
The script '''LoadAcceleratorDataV2.sql''' (see around line 305) loads both '''CohortCosWcbuuid.txt''' and '''CBCohortData.txt''' into the database '''vcdb2'''. It then produces a CohortCoExtended table, which is output to a file.
 
Note that '''CohortCoExtended.txt''' includes a variable GotVC, which takes the value 1 if the cohort company got VC and zero otherwise:
 
gotvc | count
-------+-------
0 | 11465
1 | 1504
(2 rows)
 
We now need to determine which cohort companies we have timing information for and which we don't - and use demo days to get the info we are missing!
 
==Getting Timing info for Companies Who Got VC==
 
Line 136 of
E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql
 
contains the code to find the companies which recieved VC but did not have timing info. There are 809 such companies. This table was exported into '''needtiminginfo.txt'''.
 
A list of distinct accelerators that we need timing data for was also created, which was given to [[Minh Le]]. There's 75 accelerators that need their timing doing.
 
Doing training data - 2,600 pages and are a little bit more than 1/2 way (~1500-1600).

Navigation menu