Difference between revisions of "VC Database Rebuild"

From edegan.com
Jump to navigation Jump to search
Line 370: Line 370:
 
==Joining companybasecore with mascore==
 
==Joining companybasecore with mascore==
 
Before doing this stage make sure the following is true:
 
Before doing this stage make sure the following is true:
  *companybasecore and mascore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [Cleaning the Companybase table] [Cleaning mas table] for instructions
+
  *companybasecore and mascore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See *[Cleaning the Companybase table] *[Cleaning mas table] for instructions
  
 
==Creating Stage Flags Table==
 
==Creating Stage Flags Table==

Revision as of 15:26, 20 July 2017

Plan

Rebuild roundbase, round, geo, ipos, mas from SDC data. Create companybase from roundbase Create round from roundbase. Build stageflags from round.

Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.

Build statelookup tables and roundlookup tables.

Clean firmbase tables. Clean ipo tables. Clean mas table.

Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.

Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.

Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.

Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.

Calculate exit date based on ipo, ma, datelastinv + 5 years.

Pull in sel flag into companybase and build dead or alive flag.

Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.

Clean fundbase and check valid key (fundname, statecode, firstinvdate)

Clean firmbase and check valid key (firmname, foundingdate)

Loading starting data into database

Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.

\i LoadFunds.sql
\i LoadIPOs.sql
\i LoadRoundbase.sql
\i LoadFirms.sql
\i LoadGeoData.sql
\i LoadLongDescription.sql
\i LoadRound.sql

Cleaning Process

The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.

Creating Base Tables

Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.

DROP TABLE companybase;
CREATE TABLE companybase AS
SELECT DISTINCT 
coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip 
FROM roundbase
ORDER BY coname;
DROP TABLE round;
CREATE TABLE round AS
SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors
FROM roundbase
ORDER BY coname;

Cleaning the Companybase table

Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.

So first check to see if the key is valid on the base data using the following 2 queries.

SELECT COUNT(*)
FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;
--44774
SELECT COUNT(*)
FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;
--44771

You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table. Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.

DROP TABLE companybase1;
CREATE TABLE companybase1 AS 
SELECT *,
CASE
 WHEN nationcode = 'US' THEN 1::int
 ELSE 0::int
 END AS alwaysusflag,
CASE
 WHEN coname = 'Undisclosed Company' THEN 1::int
 ELSE 0::int
 END AS undisclosedflag
FROM companybase;
SELECT COUNT(*)
FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;
--44771
SELECT COUNT(*)
FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;
--44770

By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table.

SELECT *
FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key
GROUP BY coname, statecode, datefirstinv
HAVING COUNT(key) > 1;

The output looks like this:

          coname            | statecode | datefirstinv
----------------------------+-----------+--------------
New York Digital Health LLC | NY        | 2015-08-13

We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that. Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.

DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';

companybasecore table

The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on.

DROP TABLE companybasecore;
CREATE TABLE companybasecore AS
SELECT DISTINCT 
coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip
FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';
--44740
--recheck keys
SELECT COUNT(*)
FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;
--44740
SELECT COUNT(*)
FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;
--44740

Cleaning ipos table

Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.

SELECT COUNT(*)
FROM (SELECT issuer, issuedate, statecode FROM ipos)a;
--10440
SELECT COUNT(*)
FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;
--9491

The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.

DROP TABLE ipoduplicates;
CREATE TABLE ipoduplicates AS
SELECT *, COUNT(*)
FROM (SELECT issuer, issuedate, statecode FROM ipos)a
GROUP BY issuer, issuedate, statecode
HAVING COUNT(*) > 1;
--939
\COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV;

In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.

DROP TABLE ipoinclude;
CREATE TABLE ipoinclude AS
SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt
FROM ipos
GROUP BY issuer, issuedate, statecode;
--9470

Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.

DROP TABLE ipocore;
CREATE TABLE ipocore AS
SELECT ipos.issuer, ipos.issuedate, ipos.statecode
FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND 
ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;
SELECT COUNT(*)
FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;

You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.

DROP TABLE ipoduplicates2;
CREATE TABLE ipoduplicates2 AS
SELECT *, COUNT(*)
FROM (SELECT issuer, issuedate, statecode FROM ipocore)a
GROUP BY issuer, issuedate, statecode
HAVING COUNT(*) > 1;

Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.

--manually remove bad keys
DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';
--1
DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';
--1
DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';
--1
DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';
--1
DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';
--1
DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';
--1
DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';
--1
DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';
--1
DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';
--1
DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';
--1
DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';
--1
DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';
--1
DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';
--1
DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';
--1
DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';
--1
DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';
--1
DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';
--2
DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';
--2
DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  
--2

Now again JOIN your ipos table with your ipoinclude table and check the key count.

DROP TABLE ipocore;
CREATE TABLE ipocore AS
SELECT ipos.issuer, ipos.issuedate, ipos.statecode
FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND 
ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;
--9470
SELECT COUNT(*)
FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;
--9470

The counts line up so now you should have a clean ipocore table!

Cleaning mas table

Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate.

SELECT COUNT(*)
FROM mas;
--114890 
SELECT COUNT(*)
FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;
--114825

Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.

CREATE TABLE mas1 AS
SELECT *
FROM mas;
ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;
ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;
DROP TABLE masinclude;
CREATE TABLE masinclude AS
SELECT targetname, targetstatecode, announceddate, MIN(id) as id
FROM mas1
GROUP BY targetname, targetstatecode, announceddate;
--114825
DROP TABLE mascore;
CREATE TABLE mascore AS
SELECT mas.*
FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; 
--114825
SELECT COUNT(*)
FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;

The mas distinct key count match the total count of the table so therefore the mascore table is clean.

Name Based Matching companybase keys to mas keys

Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:

DROP TABLE maskeys;
CREATE TABLE maskeys AS
SELECT DISTINCT targetname, targetstatecode, announceddate
FROM mascore;
--114825
DROP TABLE maskeysmindates;
CREATE TABLE maskeysmindates AS
SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate
FROM mascore
GROUP BY targetname, targetstatecode;
--113236
DROP TABLE maskeysdatewindow;
CREATE TABLE maskeysdatewindow AS
SELECT maskeys.*, maskeysmindates.announceddate as minanndate,
CASE WHEN maskeys.announceddate - INTERVAL '7 day' > maskeysmindates.announceddate OR maskeys.announceddate = 
maskeysmindates.announceddate THEN 1::int
ELSE 0::int
END AS dateflag
FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND 
maskeys.targetstatecode = maskeysmindates.targetstatecode);
--114825

The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.

CREATE TABLE maskeysdatefiltered AS
SELECT targetname, targetstatecode, announceddate
FROM maskeysdatewindow
WHERE dateflag = 1; 
--114794
\COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV

Grab the portco keys from the companybasecore table:

DROP TABLE portcokeys;
CREATE TABLE portcokeys AS
SELECT DISTINCT coname, statecode, datefirstinv
FROM companybasecore;
--44740
\COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV

Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see The Matcher (Tool)

Fixing Errors in the Matcher Output for portco and mas

You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate < datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is "Hall-Warning:Multiple". Then import this back into your db by creating a matcheroutput table.

DROP TABLE matcherportcomas;
CREATE TABLE matcherportcomas (
 warning varchar(100),
 file1coname varchar(100),
 file1statecode varchar(2),
 file1datefirstinv date,
 file2targetname varchar(100),
 file2targetstatecode varchar (2),
 file2announceddate date,
 excludeflag1 int,
 excludeflag2 int,
 warningflag int
);
\COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--9645

You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of "good" matches. These are matches that do not contain warnings, where the datefirstinv > announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.

SELECT COUNT(*) FROM 
(SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND 
excludeflag2 = 0 AND warningflag = 0)a;  
--8291

As you can see we're throwing out a lot of the data in the matcher file (9645 -> 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.

Select the portco keys that are matched to the minimum announceddate for any mergers:

DROP TABLE matcherwarningmindates;
CREATE TABLE matcherwarningmindates AS
SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) 
FROM matcherportcomas 
WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1
GROUP BY file1coname, file1statecode, file1datefirstinv;
--364

Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table.

DROP TABLE matcherportcomasinclude;
CREATE TABLE matcherportcomasinclude AS
SELECT m.* FROM
matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = 
mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  
WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;
--366

The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.

SELECT *, COUNT(*) FROM
(SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a
GROUP BY file1coname, file1statecode, file1datefirstinv
HAVING COUNT(*) > 1;
file1coname       | file1statecode | file1datefirstinv | count
-------------------------+----------------+-------------------+-------
 PA Inc                  | TX             | 2007-09-25        |     2
 High Sierra Energy L.P. | CO             | 2004-12-23        |     2

Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:

DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' 
AND file2targetstatecode = 'VA';
--1
DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND 
file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';
--1

Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.

SELECT COUNT(*) FROM
(SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;
--364
SELECT COUNT(*) FROM matcherportcomasinclude;
--364

Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.

CREATE TABLE matcherportcomascore AS
SELECT *
FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0
UNION ALL
SELECT *
FROM matcherportcomasinclude;
--8655

Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.

SELECT COUNT(*) FROM (
SELECT DISTINCT file1coname, file1statecode, file1datefirstinv
FROM matcherportcomascore) AS foo;
--8655

Great! Now you are ready to begin joining the companybase table to the mas table.

Joining companybasecore with mascore

Before doing this stage make sure the following is true:

*companybasecore and mascore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See *[Cleaning the Companybase table] *[Cleaning mas table] for instructions

Creating Stage Flags Table

Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.

DROP TABLE stageflags;
CREATE TABLE stageflags AS
SELECT coname, statecode, datefirstinv, rounddate, stage3,
CASE
 WHEN stage3 = 'Seed' THEN 1::int
 ELSE 0::int
END AS seedflag,
CASE
 WHEN stage3 = 'Early Stage' THEN 1::int
 ELSE 0::int
 END AS earlyflag,
CASE
 WHEN stage3 = 'Later Stage' THEN 1::int
 ELSE 0::int
 END AS laterflag,
CASE
 WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int
 ELSE 0::int
 END AS growthflag,
CASE
 WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int
 ELSE 0::int
 END AS transactionflag,
CASE
 WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' 
OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int
 ELSE 0::int
 END AS excludeflag
FROM round;