Changes

Jump to navigation Jump to search
no edit summary
{{Project|Has project output=Data|Has sponsor=McNair ProjectsCenter
|Has title=Merging Existing Data with Crunchbase
|Has owner=Connor Rothschild
Upon Ed's instruction, we then looked at companies ''in Crunchbase'' which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).
Upon refining our list based on recursive filtering, we found __ 40 companies which match our data, and added UUIDs appropriately.
===Step Two: Pulling Data===
The other columns can be added to the end of our sheet as supplemental data.
 
==SQL Scripts, Files, and Databases==
 
The contents of E:\McNair\Projects\Accelerators\Summer 2018\For Ed Merge July 17.xlsx where copied into CohortCosWcbuuid.txt (in the Accelerators folder, as well as Z:/crunchbase2, Z:/../vcdb2).
 
The script '''AddCBData.sql''' loads this data into '''crunchbase2'''. It then outputs the relevant crunchbase data into '''CBCohortData.txt'''
 
The script '''LoadAcceleratorDataV2.sql''' (see around line 305) loads both '''CohortCosWcbuuid.txt''' and '''CBCohortData.txt''' into the database '''vcdb2'''. It then produces a CohortCoExtended table, which is output to a file.
 
Note that '''CohortCoExtended.txt''' includes a variable GotVC, which takes the value 1 if the cohort company got VC and zero otherwise:
 
gotvc | count
-------+-------
0 | 11465
1 | 1504
(2 rows)
 
We now need to determine which cohort companies we have timing information for and which we don't - and use demo days to get the info we are missing!
 
==Getting Timing info for Companies Who Got VC==
 
Line 136 of
E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql
 
contains the code to find the companies which recieved VC but did not have timing info. There are 809 such companies. This table was exported into '''needtiminginfo.txt'''.
 
A list of distinct accelerators that we need timing data for was also created, which was given to [[Minh Le]]. There's 75 accelerators that need their timing doing.
 
Doing training data - 2,600 pages and are a little bit more than 1/2 way (~1500-1600).

Navigation menu