Changes

Jump to navigation Jump to search
6,887 bytes added ,  17:31, 30 January 2018
SDC files are located here and the normalized versions are copied into the Z folder above:
E:\McNair\Projects\VC Database
Database can be started by typing psql vcdb2
The file containing all the SQL queries used to build vcdb2 is located in the Z drive and named ProcessData2.sql.
Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessData2.sql
==Plan==
==Loading starting data into database==
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.
\i LoadFunds.sql
\i LoadIPOs.sql
\i LoadMAs.sql
\i LoadRoundbase.sql
\i LoadFirms.sql
--42296
\COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
'''NOTES FROM ED:'''
*There wasn't a full population deadalive portco level keyed table, so I made one
 
DROP TABLE PortCoDeadAliveMaster;
CREATE TABLE PortCoDeadAliveMaster as
SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,
extract(year from deaddate) AS deadyear
FROM deadalive;
--44740
 
This code is in '''FixingVCDB2.sql''' in Z:\VentureCapitalData\SDCVCData\vcdb2
 
==Creating coleveloutput==
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:
END AS excludeflag
FROM round;
 
'''NOTES FROM ED'''
*There wasn't a colevel, portco keyed master flag table, so I made one from '''roundplus''' (see below)
*I also included basic round summary data
 
DROP TABLE PortCoFlagMaster;
CREATE TABLE PortCoFlagMaster AS
SELECT coname, statecode, datefirstinv, city
max(growthflag) AS hadgrowth, sum(growthflag) as numgrowth,
max(seedflag) AS hadseed, sum(seedflag) as numseed,
max(earlyflag) AS hadearly, sum(earlyflag) as numearly,
max(laterflag) AS hadlater, sum(laterflag) as numlater,
max(transactionflag) AS hadtrans, sum(transactionflag) as numtrans,
max(excludeflag) AS hadexcl, sum(excludeflag) as numexcl,
max(dealflag) AS haddeal,
COUNT(rounddate) as numrounds, sum(numinvestors) as totalinvestors
FROM roundplus
GROUP BY coname, statecode, datefirstinv, city;
--44721
 
Code is in FixingVCDB2.sql in Z:\VentureCapitalData\SDCVCData\vcdb2
 
==Fixing erroneous geo-coordinates==
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.
--41718
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.
 
==Re-Fixing erroneous geo-coordinates==
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.
 
'''NOTES FROM ED:
*The lat/long data has to be stored as decimal or numeric to prevent precision problems.
*The join was to the old (wrong) base table, that had addresses copied down.
*The code below has been fixed up and rerun. The fixed up code is in ProcessingCoLevelSimple.sql
 
DROP TABLE geoallcoords;
CREATE TABLE geoallcoords (
statecode varchar(2),
datefirstinv date,
latitude realnumeric, longitude realnumeric
);
\COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--44999
--select latitude, longitude from geoallcoords;
--redo the colevel output tables
DROP TABLE coleveloutput;
CREATE TABLE coleveloutput AS
SELECT companybasecore2A.coname, companybasecore2A.statecode, companybasecore2A.datefirstinv, companybasecore2A.city, companybasecore2A.addr1, companybasecore2A.addr2, companybasecore2A.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear FROM companybasecore2 sdccompanybasecore2 AS A LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2A.coname AND d.statecode=companybasecore2A.statecode AND d.datefirstinv=companybasecore2A.datefirstinv LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2A.coname AND g.statecode = companybasecore2A.statecode AND g.datefirstinv = companybasecore2 A.datefirstinv
WHERE hadgrowthvc=1;
--3257532380
DROP TABLE colevelsimple;
CREATE TABLE colevelsimple AS
SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude
FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;
--31523
<nowiki>\COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV</nowiki> '''NOTE FROM ED:'''*The query below is now redundant. *The same cleaning is done in ProcessingCoLevelSimple.sql*The clean table is CoLevelSimpleClean*This clean table is used as the basis for CoLevelBlowOut I also added flags on the geodata table to filter points outside the US. You can use the geoallcoords1 table instead of geoallcoords and set excludeflag = 1 to filter out 292 erroneous points when you create your colevel tables. CREATE TABLE geoallcoords1 AS SELECT *, CASE WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude = NULL OR longitude = NULL THEN 1::int ELSE 0::int END AS excludeflag FROM geoallcoords; --44999  SELECT COUNT(*) FROM geoallcoords1 WHERE excludeflag = 1; --292
==Cleaning firmbase==
 
NOTE FROM MEGHANA:
 
*The base table - firmbase - was reloaded after it was found to be incorrectly normalized.
*[[Retrieving_US_VC_Data_From_SDC#VC_Funds]] was corrected
*The code below has been rerun!
 
First flag the undisclosed funds.
 
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.
 
DROP TABLE firmbase1;
CREATE TABLE firmbase1 AS
SELECT *, CASE
ELSE 0::int END AS undisclosedflag
FROM firmbase;
--145679734
SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;
--141459452
SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;
--141459452  DROP TABLE firmbasecore;
CREATE TABLE firmbasecore AS
SELECT * FROM firmbase1 WHERE undisclosedflag = 0;
--141459452
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.
GROUP BY firmname
HAVING COUNT(*) > 1;
--126
DROP TABLE firmbaseinclude;
INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname
GROUP BY f.firmname;
--126
DROP TABLE firmbasecorefirmbasec; CREATE TABLE firmbasecore firmbasec AS SELECT l.*
FROM firmbase1 AS l
LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate
WHERE r.firmname IS NULL AND undisclosedflag = 0;
--141339446  DROP TABLE firmbasecore; CREATE TABLE firmbasecore AS SELECT * FROM firmbasec WHERE firmname NOT LIKE 'Amundi%';
SELECT COUNT(DISTINCT firmname) FROM firmbasecore;
--141339445 NOTE: We have dropped 'Amundi' for reasons specified in "Cleaning branchoffices," below. ==Cleaning branchoffices==  DROP TABLE branchofficesbase; CREATE TABLE branchofficesbase AS SELECT *, CASE WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int ELSE 0::int END AS undisclosedflag FROM branchoffices; --9734  SELECT COUNT(*) FROM branchofficesbase WHERE undisclosedflag = 0; --9452  DROP TABLE branchofficecore; CREATE TABLE branchofficecore AS SELECT * FROM branchofficesbase WHERE undisclosedflag = 0; --9452  DROP TABLE bocore; CREATE TABLE bocore AS SELECT * FROM branchofficecore WHERE firmname not like 'Amundi%'; --9450  SELECT COUNT(*) FROM (SELECT DISTINCT firmname FROM bocore WHERE bocity IS NOT NULL) AS T; --1962 There are no duplicates in firmbasecore. In firmbase, 'Amundi Private Equity Funds SA' is the only duplicate, but we dropped the one with the earlier founding date when we cleaned firmbase. Amundi has 2 branch offices in the same city (Bucharest) but different zip codes, which is why there are different counts when we use firmname/firmname,bocity vs firmname, bozip as our key. NOTE: We have dropped "Amundi" from firmbasecore, as well as branchofficecore, in order to use firmname as a valid key for bocore.
==Cleaning fundbase==
 
 
NOTE FROM ED:
*The base table - '''fundbase''' - was reloaded after it was found to be incorrectly normalized.
**All sorts of fields were copied down when they shouldn't have been.
*[[Retrieving_US_VC_Data_From_SDC#VC_Funds]] was corrected
*The code in this section has been rerun.
 
First flag the undisclosed funds.
DROP TABLE fundbase1;
CREATE TABLE fundbase1 AS
SELECT *, CASE
FROM fundbase;
--27588
 
SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;
--27097
 
SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;
--27097
 
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.
 
DROP TABLE fundbasecore;
CREATE TABLE fundbasecore AS
SELECT *
SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;
--27050
 
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.
 
DROP TABLE fundnameexclude;
CREATE TABLE fundnameexclude AS
FROM fundbase1 AS f
INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;
--4417
--create fundcore table
UNION ALL
SELECT * FROM fundinclude2;
--2704727020
==Name based matching firms to funds==
SELECT fundname, firstinvdate, firmname
FROM fundbasecore;
--2709727020
\COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
FROM firmbasecore;
\COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--141459446
CREATE TABLE matcherfirmsfunds (
\COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--2364
 
==Joining firms with funds==
DROP TABLE firmfundstestjoin;
==Joining funds with roundline==
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue. SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0; --19677After fixing and reimporting the roundline into vcdb2 there are still 1,963 funds that appear in roundline that are not in fundbasecore.
SELECT countCOUNT(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundnameWHERE b.fundname IS NULL)) AS tas r; --170891963
--Look at the ones that don't match
roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE
b.fundname IS NULL;
SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;
--19091
 
SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM
roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;
--17128
SELECT COUNT(DISTINCT fundname) FROM fundbasecore;
==Creating portcoexitmaster==
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.Note that the amounts are actually in m (not k) in the input.  ALTER TABLE portcoexitbuild RENAME TO portcoexitbuildbak;  
CREATE TABLE portcoexitbuild AS
SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtkipoamt, mastransactionamtkmaamt, investedk
FROM companybaseipomasmaster;
--44740
DROP TABLE portcoexitmaster;
CREATE TABLE portcoexitmaster AS
SELECT coname, statecode, datefirstinv, investedk, CASE WHEN ipoissuedate IS NOT NULL THEN ipoissuedate WHEN masannounceddate IS NOT NULL THEN masannounceddate ELSE NULL END AS exitdate,
CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,
CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,
CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,
CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtkipoamt::numeric::float8/1000 ELSE mastransactionamtkmaamt::numeric::float8/1000 END AS
exitvaluem
FROM portcoexitbuild;
\COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
==Creating portcoroundlinefirmmaster==
portcoroundlinefirmmaster table contains portcokey, roundline, firm table.
CREATE TABLE portcoroundlinefirmmaster AS
SELECT c.coname, c.statecode, c.datefirstinv, r.rounddate, r.amountk, r.fundname, f.firmname
FROM companybasecore AS c
INNER JOIN roundline AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv
INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname
INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname;
--299321
==Joining funds, firms, roundline with companybasecore==
--22374
\COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
===Creating a PortCoMaster Table===
 
Using '''sdccompanybasecore2''' as a base table, Ed created a useful PortCoMasterTable. This table isn't perfect:
*sdccompanybasecore2 was drawn later and has 44966. It has fixed addresses (without copy down mistakes)
*The other tables - PortCoDeadAliveMaster, PortCoExitMaster, and PortCoFlagMaster - are built off slightly earlier draws and so have 44740 records
-The other tables are all keyed by coname, statecode and datefirstinv.
 
DROP TABLE PortCoMaster;
CREATE TABLE PortCoMaster AS
SELECT A.*,
B.deaddate, B.aliveyear, B.deadyear,
C.ipoflag, C.maflag, C.exitflag, C.exitvaluem,
D.hadgrowth, D.numgrowth, D.hadseed, D.numseed, D.hadearly, D.numearly, D.hadlater, D.numlater, D.hadtrans,
D.numtrans, D.hadexcl, D.numexcl, D.haddeal, D.numrounds, D.totalinvestors
FROM sdccompanybasecore2 AS A
LEFT JOIN PortCoDeadAliveMaster AS B ON A.coname=B.coname AND A.statecode=B.statecode AND
A.datefirstinv=B.datefirstinv
LEFT JOIN PortCoExitMaster AS C ON A.coname=C.coname AND A.statecode=C.statecode AND A.datefirstinv=C.datefirstinv
LEFT JOIN PortCoFlagMaster AS D ON A.coname=D.coname AND A.statecode=D.statecode AND A.datefirstinv=D.datefirstinv;
--44966
 
\COPY PortCoMaster TO 'PortCoMaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--44966
==Cleaning roundline==
Add a flag for undisclosed funds.
DROP TABLE roundline1;
CREATE TABLE roundline1 AS
SELECT *, CASE

Navigation menu