Difference between revisions of "VentureXpert Data"
Adliebster (talk | contribs) |
Adliebster (talk | contribs) |
||
Line 897: | Line 897: | ||
===Branch Offices=== | ===Branch Offices=== | ||
− | I did not use old branch office data because I could not find it anywhere in the old data set. | + | I did not use old branch office data because I could not find it anywhere in the old data set. I have since found old data in the table firmbasecoords in vcdb2. |
First copy all of the needed data out of the database to do geocoding. | First copy all of the needed data out of the database to do geocoding. |
Revision as of 15:08, 7 August 2018
VentureXpert Data | |
---|---|
Project Information | |
Project Title | VentureXpert Data |
Owner | Augi Liebster |
Start Date | June 20, 2018 |
Deadline | |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Contents
- 1 Relevant Former Projects
- 2 Location
- 3 Goal
- 4 Initial Stages
- 5 SDC Pull
- 6 Loading Tables
- 7 Cleaning Companybase, Fundbase, Firmbase, and BranchOffice
- 8 Instructions on Matching PortCos to Issuers and M&As From Ed
- 9 Cleaning IPO and MA Data
- 10 Process For Creating the PortCoExits Table
- 11 Creating ExitKeysClean
- 12 Create the PortCoExit And PortCoAliveDead Tables
- 13 GeoCoding Companies, Firms, and Branch Offices
Relevant Former Projects
Location
My scripts for SDC pulls are located in the E drive in the location:
E:\VentureXpertDatabase\ScriptsForSDCExtract
My successfully pulled and normalized files are stored in the location:
E:\VentureXpertDatabase\ExtractedDataQ2
My script for loading data is in one big text file in the location:
E:\VentureXpertDatabase\vcdb3\LoadingScripts
There are a variety of SQL files in there with self explanatory names. The file that has all of the loading scripts is called LoadingScriptsV1. The folder vcdb2 is there for reference to see what people before had done. ExtractedData is there because I pulled data before July 1st, and Ed asked me to repull the data.
Goal
I will be looking to redesign the VentureXpert Database in a way that is more intuitively built than the previous one. I will also update the database with current data.
Initial Stages
The first step of the project was to figure out what primary keys to use for each major table that I create. I looked at the primary keys used in the creation of the VC Database Rebuild and found primary keys that are decent. I have updated them and list them below:
- CompanyBaseCore- coname, statecode, datefirstinv
- IPOCore- issuer, issuedate, statecode
- MACore- target name, target state code, announceddate
- Geo - city, statecode, coname, datefirst, year
- DeadDate - conname, statecode, datefirst, rounddate (tentative could still change)
- RoundCore- conname, statecode, datefirst, rounddate
- FirmBaseCore - firmname
- FundBaseCore - fund name (firstinvedate doesn't work because not every row has an entry)
These are my initial listings and I will come back to update them if needed.
The second part of the initial stage has been to pull data from the SDC Platinum platform. I did it in July to ensure that I had two full quarters of data.
SDC Pull
When pulling data from SDC, it is a good idea to look for previously made rpt files that have the names of the pulls you will need to do. They have already been created and will save you a lot of work. The rpt files that I used are in the folder VentureXpertDB/ScriptsForSDCExtract. The files will come in pairs with one being saved as an ssh file and one as a rpt file. To update the dates to make them recent, go into the ssh file of the pair and change the date of last investment. When you open SDC, you will be given a variety of choices for which database to pull from. For each type of file chose the following:
- VentureXpert - PortCo, PortCoLong, USVC, Firms, BranchOffices, Funds, Rounds, VCFirmLong
- Mergres & Acquisition - MAs
- Global New Issues Databases - IPOs
Help on pulling data from SDC is on the SDC Platinum (Wiki) page.
VCFund Pull Problem
When pulling the VCFund1980-Present, I encountered two problems. One, is that SDC is not able to sort through the funds that are US only with the built in filters. Two, there are multiple rpt files that specify different variables for the fund pull. I pulled from both to be safe, but in the VC Database Rebuild page there is a section on the fund pull where Ed specifies which rpt file he used to pull data from SDC. Regardless I have both saved in the ExtractedData folder. After speaking with Ed, he told me to use the VCFund1980-present.rpt file to extract the data. Had various problems extracting data including freezing of SDC program or getting error Out of Memory. Check the SDC Platinum (Wiki) page to fix these issues.
Loading Tables
When I describe errors I encountered, I will not describe them using line numbers. This is because as soon as any data is added, the line numbers will become useless. Instead I recommend that you copy the normalized file you are working with into an excel file and using the filter feature. This way you can find the line number in your specific file that is causing errors and fix it in the file itself. The line numbers that PuTTY errors display are often wrong, so I relied on excel to discover the error fastest. If my instructions are not enough for you to find the error, my advice would be to find key words in the line that PuTTY is telling you is causing errors and filter through excel.
DROP TABLE roundbase; CREATE TABLE roundbase ( coname varchar(255), rounddate date, updateddate date, foundingdate date, datelastinv date, datefirstinv date, investedk real, city varchar(100), description varchar(5000), msa varchar(100), msacode varchar(10), nationcode varchar(10), statecode varchar(10), addr1 varchar(100), addr2 varchar(100), indclass varchar(100), indsubgroup3 varchar(100), indminor varchar(100), url varchar(5000), zip varchar(10), stage1 varchar(100), stage3 varchar(100), rndamtdisck real, rndamtestk real, roundnum integer, numinvestors integer );
\COPY roundbase FROM 'USVC1980-2018q2-Good.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --151549
The only error I encountered here was with Cardtronic Technology Inc. Here there was a problem with a mixture of quotation marks which cause errors in loading. Find this using the excel trick and remove it manually.
DROP TABLE ipos; CREATE TABLE ipos ( issuedate date, issuer varchar(255), statecode varchar(10), principalamt money, --million proceedsamt money, --sum of all markets in million naiccode varchar(255), --primary NAIC code zipcode varchar(10), status varchar (20), foundeddate date );
\COPY ipos FROM 'IPO1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --12107
I encountered no errors while loading this data.
DROP TABLE branchoffices; CREATE TABLE branchoffices ( firmname varchar(255), bocity varchar(100), bostate varchar(2), bocountrycode varchar(2), bonation varchar(100), bozip varchar(10), boaddr1 varchar(100), boaddr2 varchar(100) );
\COPY branchoffices FROM 'USVCFirmBranchOffices1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --10353
I encountered no errors while loading this data.
DROP TABLE roundline; CREATE TABLE roundline ( coname varchar(255), statecode varchar(2), datelastinv date, datefirstinv date, rounddate date, disclosedamt money, investor varchar(255) );
\COPY roundline FROM 'USVCRound1980-2018q2-NoFoot-normal-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --403189
I encountered no errors while loading this data.
DROP TABLE fundbase; CREATE TABLE fundbase ( fundname varchar(255), closedate date, --mm-dd-yyyy lastinvdate date, --mm-dd-yyyy firstinvdate date, --mm-dd-yyyy numportcos integer, investedk real, city varchar(100), fundyear varchar(4), --yyyy zip varchar(10), statecode varchar(2), fundsizem real, fundstage varchar(100), firmname varchar(255), dateinfoupdate date, invtype varchar(100), msacode varchar(10), nationcode varchar(10), raisestatus varchar(100), seqnum integer, targetsizefund real, fundtype varchar(100) );
\COPY fundbase FROM 'VCFund1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --29397
There is a Ukranian fund that has stray quotation marks in its name. It is called something along the lines of "VAT "ZNVKIF "Skhidno-Evropeis'lyi investytsiynyi Fond". If this does not help, you can filter in excel using Kiev as the keyword in the city column and find the line where you are getting errors. Then manually remove the commas in the actual text file. After that, the table should load correctly.
DROP TABLE firmbase; CREATE TABLE firmbase( firmname varchar(255), foundingdate date, --mm-dd-yyyy datefirstinv date, --mm-dd-yyyy datelastinv date, --mm-dd-yyyy addr1 varchar(100), addr2 varchar(100), location varchar(100), city varchar(100), zip varchar(10), areacode integer, county varchar(100), state varchar(2), nationcode varchar(10), nation varchar(100), worldregion varchar(100), numportcos integer, numrounds integer, investedk money, capitalundermgmt money, invstatus varchar(100), msacode varchar(10), rolepref varchar(100), geogpref varchar(100), indpref varchar(100), stagepref varchar(100), type varchar(100) );
\COPY firmbase FROM 'USVCFirms1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --15899
The normalization for this file was wrong when I tried to load the data. To fix this go to the file where you have removed the footer and find the column header titled Firm Capital under Mgmt{0Mil}. Delete the {0mil} and renormalize the file. Then everything should be ok. A good way to check this is to copy and paste the normalized file into an excel sheet and see whether the entries line up with their column header correctly. The second error I found was with the Kerala Ventures firm. Here the address has the word l"opera in it. This quotation will cause errors so find the line number using excel and remove it manually. The third error is in an area code where 1-8 is written. This hyphen causes errors. Interestingly, the line number given by PuTTY was correct, and I found it in my text file and deleted it manually. These were the only errors I encountered while loading this table.
DROP TABLE mas; CREATE TABLE mas ( announceddate date, effectivedate date, targetname varchar(255), targetstate varchar(100), acquirorname varchar(255), acquirorstate varchar(100), transactionamt money, enterpriseval varchar(255), acquirorstatus varchar(150) ); \COPY mas FROM 'MAUSTargetComp100pc1985-July2018-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --119432
I encountered no problems loading in this data.
DROP TABLE longdescription; CREATE TABLE longdescription( varchar(255), statecode varchar(10), fundingdate date, --date co received first inv codescription varchar(10000) --long description );
\COPY longdescription FROM 'PortCoLongDesc-Ready-normal-fixed.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --48037
I encountered no problems loading this data.
Cleaning Companybase, Fundbase, Firmbase, and BranchOffice
Cleaning Company
The primary key for port cos will be coname, datefirstinv, and statecode. Before checking whether this is a valid primary key, remove the undisclosed companies. I will explain the second part of the query concerning New York Digital Health later.
DROP TABLE companybasecore; CREATE TABLE companybasecore AS SELECT * FROM Companybase WHERE nationcode = 'US' AND coname != 'Undisclosed Company' AND NOT (coname='New York Digital Health LLC' AND statecode='NY' AND datefirstinv='2015-08-13' AND updateddate='2015-10-20'); --48001
SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore) AS T; --48001
Since the count of the table and the count of the distinct primary key is equivalent, you know that the primary key is valid. In the initial cleaning of the table, I first sorted out only the undisclosed companies. This table had 48002 rows. I then ran the DISTINCT query above and found that there are 48001 distinct rows with the coname, datefirstinv, statecode primary key. Thus there must two rows that share a primary key. I found this key using the following query:
SELECT * FROM (SELECT coname, datefirstinv, statecode FROM companybase) as key GROUP BY coname, datefirstinv, statecode HAVING COUNT(key) > 1;
The company named 'New York Digital Health LLC' came up as the company that is causing the problems. I queried to find the two rows that list this company name in companybase and chose to keep the row that had the earlier updated date. It is a good practice to avoid deleting rows from tables when possible, so I added the filter as a WHERE clause to exclude one of the New York Digital listings.
Cleaning Fundbase
The primary key for funds will be only the fundname. First get rid of all of the undisclosed funds.
DROP TABLE fundbasenound; CREATE TABLE fundbasenound AS SELECT DISTINCT * FROM fundbase WHERE fundname NOT LIKE '%Undisclosed Fund%'; --28886
SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbasenound)a; --28833
As you can see, fundbase still has rows that share fundnames. If you are wondering why the DISTINCT in the first query did not eliminate these, it is because this DISTINCT applies to the whole row not individual fundnames. Thus, only completely duplicate rows will be eliminated in the first query. I chose to keep the funds that have the earlier last investment date.
DROP TABLE fundups; CREATE TABLE fundups AS SELECT fundname, max(lastinvdate) AS lastinvdate FROM fundbasenound GROUP BY fundname HAVING COUNT(*)>1; --53
DROP TABLE fundbasecore; CREATE TABLE fundbasecore AS SELECT A.* FROM fundbasenound AS A LEFT JOIN fundups AS B ON A.fundname=B.fundname AND A.lastinvdate=B.lastinvdate WHERE B.fundname IS NULL AND B.lastinvdate IS NULL; --28833
Since the count of fundbasecore is the same as the number of distinct fund names, we know that the fundbasecore table is clean. In the first query I am finding duplicate rows and choosing the row that has the greater last investment date. I then match this table back to fundbasenound but choose all the rows from fundbasecore for which there is no corresponding fund in fundups based on fund name and date of last investment. This allows the funds with the earlier date of last investment to be chosen.
Cleaning Firmbase
The primary key for firms will be firm name. First I got rid of all undisclosed firms. I also filtered out two firms that have identical firm names and founding dates. The reason for this is because I use founding dates to filter out duplicate firm names. If there are two rows that have the same firm name and founding date, they will not be filtered out by the third query below. Thus, I chose to filter those out completely.
DROP TABLE firmbasenound; CREATE TABLE firmbasenound AS SELECT DISTINCT * FROM firmbase WHERE firmname NOT LIKE '%Undisclosed Firm%' AND firmname NOT LIKE '%Amundi%' AND firmname NOT LIKE '%Schroder Adveq Management%'; --15452
SELECT COUNT(*) FROM(SELECT DISTINCT firmname FROM firmbasenound)a; --15437
Since these counts are not equal we will have to clean the table further. We will use the same method from before.
DROP TABLE firmdups; CREATE TABLE firmdups AS SELECT firmname, max(foundingdate) as foundingdate FROM firmbasenound GROUP BY firmname HAVING COUNT(*)>1; --15
DROP TABLE firmbasecore; CREATE TABLE firmbasecore AS SELECT A.* FROM firmbasenound AS A LEFT JOIN firmdups AS B ON A.firmname=B.firmname AND A.foundingdate=B.foundingdate WHERE B.firmname IS NULL AND B.foundingdate IS NULL; --15437
Since the count of firmbasecore and the DISTINCT query are the same, the firm table is now clean.
Cleaning Branch Offices
When cleaning the branch offices, I had to remove all duplicates in the table. This is because the table is so sparse that often the only data in a row would be the fund name the branch was associated with. Thus, I couldn't filter based on dates as I had been doing previously for firms and funds. The primary key is firm name.
DROP TABLE bonound; CREATE TABLE bonound AS SELECT *, CASE WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int ELSE 0::int END AS undisclosedflag FROM branchoffices; --10353
SELECT COUNT(*) FROM(SELECT DISTINCT firmname FROM bonound)a; --10042
Since these counts aren't the same, we will have to work a little more to clean the table. As stated above, I did this by excluding the firm names that were duplicated.
DROP TABLE branchofficecore; CREATE TABLE branchofficecore AS SELECT A.* FROM bonound AS A JOIN ( SELECT bonound.firmname, COUNT(*) FROM bonound GROUP BY firmname HAVING COUNT(*) =1 ) AS B ON A.firmname=B.firmname WHERE undisclosedflag=0; --10032
SELECT COUNT(*) FROM (SELECT DISTINCT firmname FROM branchofficecore)a; --10032
Since these counts are the same, we are good to go. The count is 10 lower because we completely removed 10 firmnames from the listing by throwing out the duplicates.
Instructions on Matching PortCos to Issuers and M&As From Ed
Company Standardizing
Get portco keys
DROP TABLE portcokeys; CREATE TABLE portcokey AS SELECT coname, statecode, datefirst FROM portcocore; --CHECK COUNT IS SAME AS portcocore OR THESE KEYS ARE VALID AND FIX THAT FIRST
Get distinct coname and put it in a file
\COPY (SELECT DISTINCT coname FROM portcokeys) TO 'DistinctConame.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV
Match that to itself
Move DistinctConame.txt to E:\McNair\Software\Scripts\Matcher\Input Open powershell and change directory to E:\McNair\Software\Scripts\Matcher Run the matcher in mode2: perl Matcher.pl -file1="DistinctConame.txt" -file2="DistinctConame.txt" -mode=2 Pick up the output file from E:\McNair\Software\Scripts\Matcher\Output (it is probably called DistinctConame.txt-DistinctConame.txt.matched) and move it to your Z drive directory
Load the matches into the dbase
DROP TABLE PortcoStd; CREATE TABLE PortcoStd ( conamestd varchar(255), coname varchar(255), norm varchar(100), x1 varchar(255), x2 varchar(255) ); \COPY CohortCoStd FROM 'DistinctConame.txt-DistinctConame.txt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --YOUR COUNT
Join the Conamestd back to the portcokeys table to create your matching table
DROP TABLE portcokeysstd; CREATE TABLE portcokeysstd AS SELECT B.conamestd, A.* FROM portcokey AS A JOIN PortcoStd AS B ON A.coname=B.coname --CHECK COUNT IS SAME AS portcokey OR YOU LOST SOME NAMES OR INFLATED THE DATA
Put that in a file for matching (conamestd is in first column by construction)
\COPY portcokeysstd TO 'PortCoMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --YOUR COUNT
MA Cleaning and Matching
First remove all of the duplicates in the MA data. Do this by running aggregate queries on every column except for the primary key:
DROP TABLE MANoDups; CREATE TABLE MANoDups AS SELECT targetname, targetstate, announceddate, min(effectivedate) AS effectivedate, MIN(acquirorname) as acquirorname, MIN(acquirorstate) as acquirorstate, MAX(transactionamt) as transactionamt, MAX(enterpriseval) as enterpriseval, MIN(acquirorstatus) as acquirorstatus FROM mas GROUP BY targetname, targetstate, announceddate ORDER BY targetname, targetstate, announceddate; --119374
SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM manodups)a; --119374
Since these counts are equivalent, the data set is clean. Then get all the primary keys from the table and copy the distinct target names into a text file.
DROP TABLE makey; CREATE TABLE makey AS SELECT targetname, targetstate, announceddate FROM manodups; --119374
\COPY (SELECT DISTINCT targetname FROM makey) TO 'DistinctTargetName.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV; --117212
After running this list of distinct target names through the matcher, put the standardized MA list into the data base.
DROP TABLE MaStd; CREATE TABLE MaStd ( targetnamestd varchar(255), targetname varchar(255), norm varchar(100), x1 varchar(255), x2 varchar(255) );
\COPY mastd FROM 'DistinctTargetName.txt-DistinctTargetName.txt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --117212
Then match the list of standardized names back to the makey table to get a table with standardized keys and primary keys. This will be your input for matching against port cos.
DROP TABLE makeysstd; CREATE TABLE makeysstd AS SELECT B.targetnamestd, A.* FROM makey AS A JOIN mastd AS B ON A.targetname=B.targetname; --119374
\COPY makeysstd TO 'MAMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --119374
Use this text file to match against the PortCoMatchInput. Your job will be to determine whether the matches between the MAs and PortCos are true matches. The techniques that I used are described in the section below.
IPO Cleaning and Matching
The process is the same for IPOs.
DROP TABLE iponodups; CREATE TABLE iponodups AS SELECT issuer, statecode, issuedate, MAX(principalamt) AS principalamt, MAX(proceedsamt) AS proceedsamt, MIN(naiccode) as naicode, MIN(zipcode) AS zipcode, MIN(status) AS status, MIN(foundeddate) AS foundeddate FROM ipos GROUP BY issuer, statecode, issuedate ORDER BY issuer, statecode, issuedate; --11149
SELECT COUNT(*) FROM(SELECT DISTINCT issuer, statecode, issuedate FROM iponodups)a; --11149
DROP TABLE ipokeys; CREATE TABLE ipokeys AS SELECT issuer, statecode, issuedate FROM iponodups; --11149
\COPY (SELECT DISTINCT issuer FROM ipokeys) TO 'IPODistinctIssuer.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --10803
DROP TABLE ipokeysstd; CREATE TABLE ipokeysstd ( issuerstd varchar(255), issuer varchar(255), norm varchar(100), x1 varchar(255), x2 varchar(255) ); \COPY ipokeysstd FROM 'IPODistinctIssuer.txt-IPODistinctIssuer.txt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --10803
DROP TABLE ipostd; CREATE TABLE ipostd AS SELECT B.issuerstd, A.* FROM ipokeys AS A JOIN ipokeysstd AS B ON A.issuer=B.issuer; --11149
\COPY ipostd TO 'IPOMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --11149
As with MA, match this file against the PortCoMatchInput file without mode 2. Then manually check the matches using the techniques described below.
I generally use MAX for amounts and MIN for dates. I also chose to use MIN on text strings.
Cleaning IPO and MA Data
It is important to follow Ed's direction of cleaning the data using aggregate function before putting the data into excel. This will keep you from a lot of manual checking that is unnecessary. When ready, paste the data you have into an excel file. In that excel file, I made three columns: one to check whether state codes were equivalent, one checking whether the date of first investment was 3 years before the MA or IPO, and one checking whether both of these conditions were satisfied for each company. I did this using simple if statements. This process is manual checking and filtering to see whether matches are correct or not and are thus extremely subjective and tedious. First, I went through and checked the companies that did not have equivalent state codes. If the company was one that I knew or the name was unique to the point that I did not believe the same name would appear in another state, I marked the state codes as equivalent. I did the same for the date of first investment vs MA/IPO date. Then I removed all duplicates that had the marking Warning Multiple Matches, and the data sheets were clean.
Process For Creating the PortCoExits Table
MA Process
First we must load the clean, manually checked tables back into the database.
DROP TABLE MAClean; CREATE TABLE MAClean ( conamestd varchar(255), targetnamestd varchar(255), method varchar(100), x1 varchar(255), coname varchar(255), statecode varchar(10), datefirstinv date, x2 varchar(255), targetname varchar(255), targetstate varchar(10), announceddate date );
\COPY MAClean FROM 'MAClean.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --7205
SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM MAClean)a; --7188
As you can see there are still duplicate primary keys in the table. To get rid of these I wrote a query that chooses primary keys that occur only once and matches them against MANoDups. That way you will have unique primary keys by construction.
DROP TABLE MACleanNoDups; CREATE TABLE MACleanNoDups AS SELECT A.*, effectivedate, transactionamt, enterpriseval, acquirorstatus FROM MAClean AS A JOIN ( SELECT targetname, targetstate, announceddate, COUNT(*) FROM MAClean GROUP BY targetname, targetstate, announceddate HAVING COUNT(*)=1 ) AS B ON A.targetname=B.targetname AND A.targetstate=B.targetstate AND A.announceddate=B.announceddate LEFT JOIN MANoDups AS C ON A.targetname=C.targetname AND A.targetstate=C.targetstate AND A.announceddate=C.announceddate; --7171
SELECT COUNT(*) FROM(SELECT DISTINCT coname, statecode, datefirstinv FROM MACleanNoDups)a; --7171
Thus the portco primary key is unique in the table. We will use this later. Now do the same for the IPOs.
IPO Process
DROP TABLE IPOClean; CREATE TABLE IPOClean ( conamestd varchar(255), issuernamestd varchar(255), method varchar(100), x1 varchar(255), coname varchar(255), statecode varchar(10), datefirstinv date, x2 varchar(255), issuername varchar(255), issuerstate varchar(10), issuedate date ); \COPY IPOClean FROM 'IPOClean.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --2146
SELECT COUNT(*) FROM(SELECT DISTINCT issuername, issuerstate, issuedate FROM IPOClean)a; --2141
As with the MA process, there were duplicates in the clean IPO table. Get rid of these using the same process as with MAs. Only choose the primary keys that occur once and join these to the IPONoDups table.
DROP TABLE IPOCleanNoDups; CREATE TABLE IPOCleanNoDups AS SELECT A.*, principalamt, proceedsamt, naicode as naics, zipcode, status, foundeddate FROM IPOClean AS A JOIN ( SELECT issuername, issuerstate, issuedate, COUNT(*) FROM IPOClean GROUP BY issuername, issuerstate, issuedate HAVING COUNT(*)=1 ) AS B ON A.issuername=B.issuername AND A.issuerstate=B.issuerstate AND A.issuedate=B.issuedate LEFT JOIN IPONoDups AS C ON A.issuername=C.issuer AND A.issuerstate=C.statecode AND A.issuedate=C.issuedate; --2136
SELECT COUNT(*) FROM(SELECT DISTINCT coname, statecode, datefirstinv FROM IPOCleanNoDups)a; --2136
Now the duplicates are out of the MAClean and IPOClean data and we can start to construct the ExitKeysClean table.
Creating ExitKeysClean
First I looked for the PortCos that were in both the MAs and the IPOs. I did this using:
DROP TABLE IPOMAForReview; CREATE TABLE IPOMAForReview SELECT A.*, B.targetname, B.targetstate, B.announcedate FROM IPOCleanNoDups AS A JOIN MACleanNoDups AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv; --92
I then pulled out the IPOs that were only IPOs and MAs that were only MAs. I also added in a column that indicated whether a company underwent an IPO or a MA.
DROP TABLE IPONoConflict; CREATE TABLE IPONoConflict AS SELECT A.*, 1::int as IPOvsMA FROM IPOCleanNoDups AS A LEFT JOIN MACleanNoDups AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE B.statecode IS NULL AND B.coname IS NULL AND B.datefirstinv IS NULL; --2044
DROP TABLE MANoConflict; CREATE TABLE MANoConflict AS SELECT A.*, 0::int as IPOvsMA FROM MACleanNoDups AS A LEFT JOIN IPOCleanNoDups AS B ON A.coname=B.Coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE B.statecode IS NULL AND B.coname IS NULL AND B.datefirstinv IS NULL; --7079
Since 2136-92=2044 and 7171-92=7079, we know that the duplicate companies were extracted successfully.
I then wrote a query to check whether the IPO issue date or announced date of the MA was earlier and used that to indicate whether I chose the company to have undergone an MA or an IPO in the column MSvsIPO. A 0 in the column represented an MA being chosen and a 1 represented an IPO being chosen.
Then out of this table I extracted the MAs and IPOs using the the created MAvsIPO flag:
DROP TABLE MASelected; CREATE TABLE MASelected AS SELECT coname, statecode, datefirstinv, targetname, targetstate, announceddate, 0::int as IPOvsMA FROM IPOMAForReview WHERE issuedate >= announceddate; --25
DROP TABLE IPOSelected; CREATE TABLE IPOSelected AS SELECT coname, statecode, datefirstinv, issuername, issuerstate, issuedate, 1::int as IPOvsMA FROM IPOMAForReview WHERE issuedate < announceddate; --67
I then made the ExitKeysClean table using the portco primary key and the indicator MAvsIPO indicator column.
DROP TABLE ExitKeys; CREATE TABLE ExitKeys AS SELECT coname, statecode, datefirstinv, ipovsma FROM IPONoConflict UNION SELECT coname, statecode, datefirstinv, ipovsma FROM IPOSelected UNION SELECT coname, statecode, datefirstinv, ipovsma FROM MANoConflict UNION SELECT coname, statecode, datefirstinv, ipovsma FROM MASelected; --9215
Create the PortCoExit And PortCoAliveDead Tables
From consulting with Ed and the VC Database Rebuild wiki, I decided to make the PortCoExit table with an mavsipo, an exitdate, an exited, and an exitvalue column. I use the MAvsIPO column to add in data. It is very important that you have constructed this column.
DROP TABLE PortCoExit; CREATE TABLE PortCoExit AS SELECT A.coname, A.statecode, A.datefirstinv, CASE WHEN a.mavsipo=0 THEN 0::int WHEN a.mavsipo=1 THEN 1::int ELSE NULL::int END AS mavsipo, CASE WHEN a.mavsipo=0 THEN B.announceddate WHEN a.mavsipo=1 THEN C.issuedate ELSE NULL::date END AS exitdate, CASE WHEN a.mavsipo=0 OR a.mavsipo=1 THEN 1::int ELSE 0::int END AS exited, CASE WHEN a.mavsipo=0 THEN B.transactionamt WHEN a.mavsipo=1 THEN C.proceedsamt ELSE NULL::money END AS exitvalue FROM ExitKeysClean AS A LEFT JOIN MANoDups AS B ON A.targetname=B.targetname AND A.targetstate=B.targetstate AND A.announceddate=B.announceddate LEFT JOIN IPONoDups AS C ON A.issuername=C.issuer AND A.issuerstate=C.statecode AND A.issuedate=C.issuedate; --48001
I then used this table to build one that has information as to whether a company was dead or alive. I found this information by checking whether a company had undergone an IPO or MA, indicating the company was dead. Alternatively, if the company's date of last investment was more than 5 years ago, I marked the company as dead.
DROP TABLE PortCoAliveDead; CREATE TABLE PortCoAliveDead AS SELECT *, datefirstinv as alivedate, extract(year from datefirstinv) as aliveyear, CASE WHEN exitdate IS NOT NULL then exitdate WHEN exitdate IS NULL AND (datelastinv + INTERVAL '5 year') < '7/1/2018' THEN (datelastinv + INTERVAL '5 year') ELSE NULL::date END AS deaddate, CASE WHEN exitdate IS NOT NULL then exityear WHEN exitdate IS NULL AND (datelastinv + INTERVAL '5 year') < '7/1/2018' THEN extract(year from (datelastinv + INTERVAL '5 year')) ELSE NULL::int END AS deadyear FROM PortCoExit; --48001
GeoCoding Companies, Firms, and Branch Offices
A helpful page here is the Geocode.py page which explains how to use the Geocoding script. You will have to tweak the Geocode script when geocoding as each of these tables has a different primary key. It is vital that you include the primary keys in the file you input and output from the Geocoding script. Without these, you will not be able to join the latitudes and longitudes back to the firm, branch office, or company base tables.
Geocoding costs money since we are using the Google Maps API. The process doesn't cost much, but in order to save money I tried to salvage as much of the preexisting geocode information I could find.
Companies
I found the table of old companies with latitudes and longitudes in vcdb2 and loaded these into vcdb3.
DROP TABLE oldgeocords; CREATE TABLE oldgeocords ( coname varchar(255), statecode varchar(10), datefirstinv date, ivestedk real, city varchar(255), addr1 varchar(255), addr2 varchar(100), latitude numeric, longitude numeric );
\COPY oldgeocords FROM 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --44740
The API occasionally will give erroneous latitude and longitude readings. In order to catch only the good ones, I found the latitude and longitude lines that encompass the mainland US and created an exclude flag to make sure companies were in this box. I then created flags to include companies in Puerto Rico, Hawaii, and Alaska. Companies that were in these places often had wrong latitude and longitude readings of 44.93, 7.54, so I ran a query making sure that these weren't listed.
DROP TABLE geoallcoords; CREATE TABLE geoallcoords AS SELECT *, CASE WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 0::int END AS excludeflag FROM oldgeocords; --44740
DROP TABLE geoallcoords1; CREATE TABLE geoallcoords1 AS SELECT *, CASE WHEN statecode='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag, CASE WHEN statecode='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag, CASE WHEN statecode='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as akflag FROM geoallcoords; --44740
I then included only companies that were either in the mainland US, Hawaii, Alaska, or Puerto Rico.
DROP TABLE goodgeoold; CREATE TABLE goodgeoold AS SELECT A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM companybasecore AS A LEFT JOIN geoallcoords1 AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1; --38498
I then found the remaining companies that needed to be geocoded. Only companies that have addresses listed are able to be accurately geocoded. If we attempt to geocode based on city, the location returned will simply be the center of the city. Thus, I chose the companies that we did not already have listings for and had a valid address.
DROP TABLE remaininggeo; CREATE TABLE remaininggeo AS SELECT A.coname, A.statecode, A.datefirstinv, A.addr1, A.addr2, A.city, A.zip FROM companybasecore AS A LEFT JOIN goodgeoold AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE B.coname IS NULL AND A.addr1 IS NOT NULL; --5955
\COPY remaininggeo TO 'RemainingGeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --5955
I copied this table into excel to concatenate the address, city, state, and zipcode columns into one column. This can and should be done in SQL, but I was not aware this could be done. I then ran remaininggeo through the Geocode script with columns coname, statecode, datefirstinv, and address in the inputted file.
DROP TABLE remaining; CREATE TABLE remaining ( coname varchar(255), statecode varchar(10), datefirstinv date, latitude numeric, longitude numeric );
\COPY remaining FROM 'RemainingLatLong.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --5955
I then ran the same geographical checks on the newly geocoded companies and found all of the good geocodes.
DROP TABLE geoallcoords2; CREATE TABLE geoallcoords2 AS SELECT *, CASE WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 0::int END AS excludeflag FROM remaining; --5955
DROP TABLE geoallcoords3; CREATE TABLE geoallcoords3 AS SELECT *, CASE WHEN statecode='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int 0::int END as prflag, CASE WHEN statecode='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag, CASE WHEN statecode='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as akflag FROM geoallcoords2; --5955
DROP TABLE goodgeonew; CREATE TABLE goodgeonew AS SELECT A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM companybasecore AS A LEFT JOIN geoallcoords3 AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1; --5913
I then combined the old and new geocodes and matched them back to the company base table to get a geo table for companies.
DROP TABLE geocodesportco; CREATE TABLE geocodesportco AS SELECT A.* from goodgeonew UNION SELECT B.* from goodgeoold; --44411
DROP TABLE portcogeo; CREATE TABLE portcogeo AS SELECT A.*, B.latitude, B.longitude FROM companybasecore AS A LEFT JOIN Geocodesportco AS B ON A.coname=B.coname AND A.datefirstinv=B.datefirstinv AND A.statecode=B.statecode; --48001
Firms
This process is largely the same as for companies. I found old firms that had already been geocoded and checked for accuracy.
DROP TABLE oldfirmcoords; CREATE TABLE oldfirmcoords ( firmname varchar(255), latitude numeric, longitude numeric ); \COPY oldfirmcoords FROM 'FirmCoords.txt' DELIMITER AS E'\t' HEADER NULL AS CSV --5556
DROP TABLE firmoldfilter; CREATE TABLE firmoldfilter AS SELECT *, CASE WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 0::int END AS excludeflag FROM oldfirmcoords; --5556
Since oldfirmcoords does not have state codes, we have to find a way to include state codes to add in companies based in Puerto Rico, Hawaii, and Alaska. I did this by matching the firmoldfilter table back to the firm base table.
DROP TABLE firmcoordsmatch1; CREATE TABLE firmcoordsmatch1 AS SELECT A.firmname, A.state, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM firmbasecore AS A LEFT JOIN firmoldfilter AS B ON A.firmname=B.firmname; --15437
Then the process of tagging the PR, HI, and AK companies and including only correctly tagged companies is the same as for companies.
DROP TABLE firmcoordsexternal; CREATE TABLE firmcoordsexternal AS SELECT *, CASE WHEN state='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag, CASE WHEN state='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag, CASE WHEN state='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as akflag FROM firmcoordsmatch1; --15437
DROP TABLE goodfirmgeoold; CREATE TABLE goodfirmgeoold AS SELECT A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM firmcoreonedupremoved AS A LEFT JOIN firmcoordsexternal AS B ON A.firmname=B.firmname WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1; --5346
Find the remaining firms and run the geocode script on these firms
DROP TABLE remainingfirm; CREATE TABLE remainingfirm AS SELECT A.firmname, A.addr1, A.addr2, A.city, A.state, A.zip FROM firmcoreonedupremoved AS A LEFT JOIN goodfirmgeoold AS B ON A.firmname=B.firmname WHERE B.firmname IS NULL AND A.addr1 IS NOT NULL AND A.msacode!='9999'; --706
\COPY remainingfirm TO 'FirmGeoRemaining.txt' DELIMITER AS E'\t' HEADER NULL AS CSV --706
DROP TABLE firmremainingcoords; CREATE TABLE firmremainingcoords( firmname varchar(255), latitude numeric, longitude numeric );
\COPY firmremainingcoords FROM 'FirmRemainingCoords.txt' DELIMITER AS E'\t' HEADER NULL AS CSV --706
Follow the same filtering process as above to get the good geocodes.
DROP TABLE firmnewfilter; CREATE TABLE firmnewfilter AS SELECT *, CASE WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 0::int END AS excludeflag FROM firmremainingcoords; --706
DROP TABLE firmcoordsmatch2; CREATE TABLE firmcoordsmatch2 AS SELECT A.firmname, A.state, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM firmcoreonedupremoved AS A LEFT JOIN firmnewfilter AS B ON A.firmname=B.firmname; --15437
DROP TABLE firmcoordsexternalremaining; CREATE TABLE firmcoordsexternalremaining AS SELECT *, CASE WHEN state='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag, CASE WHEN state='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag, CASE WHEN state='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as akflag FROM firmcoordsmatch2; --15437
DROP TABLE goodfirmgeonew; CREATE TABLE goodfirmgeonew AS SELECT A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM firmcoreonedupremoved AS A LEFT JOIN firmcoordsexternalremaining AS B ON A.firmname=B.firmname WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1; --703
Combine the old and new geocoded firms and match them to firm base to get a firm geo table.
DROP TABLE firmgeocoords; CREATE TABLE firmgeocoords AS SELECT * FROM goodfirmgeonew UNION SELECT * FROM goodfirmgeoold; --6049
DROP TABLE firmgeocore; CREATE TABLE firmgeocore AS SELECT A.*, B.latitude, B.longitude FROM firmbasecore AS A LEFT JOIN firmgeocoords AS B ON A.firmname=B.firmname; --15437
Branch Offices
I did not use old branch office data because I could not find it anywhere in the old data set. I have since found old data in the table firmbasecoords in vcdb2.
First copy all of the needed data out of the database to do geocoding.
\COPY (SELECT A.firmname, A.boaddr1, A.boaddr2, A.bocity, A.bostate, A.bozip FROM bonound AS A WHERE A.boaddr1 IS NOT NULL) TO 'BranchOffices.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --2046
Then load the data into the database and follow the same filtering process as above.
DROP TABLE bogeo; CREATE TABLE bogeo ( firmname varchar(255), latitude numeric, longitude numeric );
\COPY bogeo FROM 'BranchOfficesGeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV --2046
DROP TABLE bogeo1; CREATE TABLE bogeo1 AS SELECT *, CASE WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 0::int END AS excludeflag FROM bogeo; --2046
DROP TABLE bomatchgeo; CREATE TABLE bomatchgeo AS SELECT A.*, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM branchofficecore AS A LEFT JOIN bogeo1 AS B ON A.firmname=B.firmname; --10032
DROP TABLE bogeo2; CREATE TABLE bogeo2 AS SELECT *, CASE WHEN bostate='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag, CASE WHEN bostate='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag, CASE WHEN bostate='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as akflag FROM bomatchgeo; --10032
Match the correctly geocoded branch offices back to firm base to get the final table.
DROP TABLE bogeocore1; CREATE TABLE bogeocore1 AS SELECT * FROM bogeo2 WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1; --1161
DROP TABLE firmbogeo; CREATE TABLE firmbogeo AS SELECT A.*, B.latitude AS BOLatitude, B.longitude AS BOLongitude FROM firmgeocore AS A LEFT JOIN bogeocore1 AS B ON A.firmname=B.firmname; --15437