Difference between revisions of "VC Database Rebuild"

From edegan.com
Jump to navigation Jump to search
Line 39: Line 39:
 
  \i LoadLongDescription.sql
 
  \i LoadLongDescription.sql
 
  \i LoadRound.sql
 
  \i LoadRound.sql
 +
 +
==Cleaning Process==
 +
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables. 
  
 
==Creating Base Tables==
 
==Creating Base Tables==

Revision as of 18:20, 17 July 2017

Plan

Rebuild roundbase, round, geo, ipos, mas from SDC data. Create companybase from roundbase Create round from roundbase. Build stageflags from round.

Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.

Build statelookup tables and roundlookup tables.

Clean firmbase tables. Clean ipo tables. Clean mas table.

Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.

Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.

Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.

Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.

Calculate exit date based on ipo, ma, datelastinv + 5 years.

Pull in sel flag into companybase and build dead or alive flag.

Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.

Clean fundbase and check valid key (fundname, statecode, firstinvdate)

Clean firmbase and check valid key (firmname, foundingdate)

Loading starting data into database

Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.

\i LoadFunds.sql
\i LoadIPOs.sql
\i LoadRoundbase.sql
\i LoadFirms.sql
\i LoadGeoData.sql
\i LoadLongDescription.sql
\i LoadRound.sql

Cleaning Process

The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.

Creating Base Tables

Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.

DROP TABLE companybase;
CREATE TABLE companybase AS
SELECT DISTINCT 
coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip 
FROM roundbase
ORDER BY coname;
DROP TABLE round;
CREATE TABLE round AS
SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors
FROM roundbase
ORDER BY coname;

Creating Stage Flags Table

Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.

DROP TABLE stageflags;
CREATE TABLE stageflags AS
SELECT coname, statecode, datefirstinv, rounddate, stage3,
CASE
 WHEN stage3 = 'Seed' THEN 1::int
 ELSE 0::int
END AS seedflag,
CASE
 WHEN stage3 = 'Early Stage' THEN 1::int
 ELSE 0::int
 END AS earlyflag,
CASE
 WHEN stage3 = 'Later Stage' THEN 1::int
 ELSE 0::int
 END AS laterflag,
CASE
 WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int
 ELSE 0::int
 END AS growthflag,
CASE
 WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int
 ELSE 0::int
 END AS transactionflag,
CASE
 WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' 
OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int
 ELSE 0::int
 END AS excludeflag
FROM round;

Cleaning the Companybase table

Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.

So first check to see if the key is valid on the base data using the following 2 queries.

SELECT COUNT(*)
FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;
--44774
SELECT COUNT(*)
FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;
--44771

You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table. Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.

DROP TABLE companybase1;
CREATE TABLE companybase1 AS 
SELECT *,
CASE
 WHEN nationcode = 'US' THEN 1::int
 ELSE 0::int
 END AS alwaysusflag,
CASE
 WHEN coname = 'Undisclosed Company' THEN 1::int
 ELSE 0::int
 END AS undisclosedflag
FROM companybase;
SELECT COUNT(*)
FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;
--44771
SELECT COUNT(*)
FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;
--44770

By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table.

SELECT *
FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key
GROUP BY coname, statecode, datefirstinv
HAVING COUNT(key) > 1;

The output looks like this:

          coname            | statecode | datefirstinv
----------------------------+-----------+--------------
New York Digital Health LLC | NY        | 2015-08-13

We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that. Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.

DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';

companybasecore table

The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on.

DROP TABLE companybasecore;
CREATE TABLE companybasecore AS
SELECT DISTINCT 
coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip
FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';
--44740
--recheck keys
SELECT COUNT(*)
FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;
--44740
SELECT COUNT(*)
FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;
--44740