Difference between revisions of "VentureXpert Data"

From edegan.com
Jump to navigation Jump to search
Line 218: Line 218:
 
The third error is in an area code where 1-8 is written. This hyphen causes errors. Interestingly, the line number given by PuTTY was correct, and I found it in my text file and deleted it manually.
 
The third error is in an area code where 1-8 is written. This hyphen causes errors. Interestingly, the line number given by PuTTY was correct, and I found it in my text file and deleted it manually.
 
These were the only errors I encountered while loading this table.
 
These were the only errors I encountered while loading this table.
 +
 +
 +
DROP TABLE mas;
 +
CREATE TABLE mas (
 +
  announceddate date,
 +
  effectivedate date,
 +
  targetname varchar(255),
 +
  targetstate varchar(100),
 +
  acquirorname varchar(255),
 +
  acquirorstate varchar(100),
 +
  transactionamt money,
 +
  enterpriseval varchar(255),
 +
  acquirorstatus varchar(150)
 +
 +
);
 +
 +
\COPY mas FROM 'MAUSTargetComp100pc1985-July2018-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--119432
 +
  
 
==Instructions on Matching PortCos to Issuers and M&As From Ed==
 
==Instructions on Matching PortCos to Issuers and M&As From Ed==
Line 368: Line 387:
 
  FROM MAClean
 
  FROM MAClean
 
  GROUP BY targetname, targetstate, announceddate;
 
  GROUP BY targetname, targetstate, announceddate;
 +
--7189
 +
There are two companies that have the name Masspower in the MACleanNoDups file. One is written in all caps and will thus not be caught by the aggregate function. You will have to find it manually and delete it in order to ensure that your joining of MAs to companybasecore will not add in extra rows.
 +
DELETE FROM MaCleanNoDups WHERE coname='Masspower' AND announceddate='2006-03-15';
 +
--7188
 +
 +
Now do the same for the IPOs.
  
 
  DROP TABLE IPOCleanNoDups;
 
  DROP TABLE IPOCleanNoDups;
Line 377: Line 402:
 
  --2141
 
  --2141
  
Now the duplicates are out of the MAClean and IPOClean data and we can start to construct the PortCoExit table.
+
Now the duplicates are out of the MAClean and IPOClean data and we can start to construct the ExitKeysClean table.
+
 
 +
==Creating ExitKeysClean===
  
 
First I looked for the PortCos that were in both the MAs and the IPOs. I did this using:
 
First I looked for the PortCos that were in both the MAs and the IPOs. I did this using:
Line 386: Line 412:
 
  FROM IPOClean AS A
 
  FROM IPOClean AS A
 
  JOIN MAClean AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;
 
  JOIN MAClean AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;
 +
--92
 +
 +
I then pulled out the IPOs that were only IPOs and MAs that were only MAs.
 +
 +
DROP TABLE ExitKeysCleanIPONoExit;
 +
CREATE TABLE ExitKeysCleanIPONoExit AS
 +
SELECT A.*, B.targetname, B.targetstate, B.announceddate
 +
FROM IPOCleanNoDups AS A LEFT JOIN MACleanNoDups AS B ON A.coname=B.Coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE B.statecode IS NULL AND
 +
B.coname IS NULL AND B.datefirstinv IS NULL;
 +
--2049
 +
 +
DROP TABLE ExitKeysCleanMANoExit;
 +
CREATE TABLE ExitKeysCleanMANoExit AS
 +
SELECT A.*, B.issuername, B.issuerstate, B.issuedate
 +
FROM MACleanNoDups AS A LEFT JOIN IPOCleanNoDups AS B ON A.coname=B.Coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE B.statecode IS NULL AND
 +
B.coname IS NULL AND B.datefirstinv IS NULL;
 +
--7096
 +
 +
Since 2141-92=2019 and 7188-92=1096, we know that the duplicate companies were extracted successfully.
 +
 +
I then created a column that identifies whether a company underwent an MA or an IPO. A 0 indicates an MA and a 1 indicates an IPO.
 +
DROP TABLE ExitKeysCleanMA;
 +
CREATE TABLE ExitKeysCleanMA AS
 +
SELECT *,
 +
CASE
 +
  WHEN issuername IS NULL AND issuerstate IS NULL AND issuedate IS NULL THEN 0 ELSE NULL END AS MAvsIPO
 +
FROM exitkeyscleanmanoexit;
 +
--7096
 +
 +
DROP TABLE ExitKeysCleanIPO;
 +
CREATE TABLE ExitKeysCleanIPO AS
 +
SELECT *,
 +
CASE
 +
  WHEN targetname IS NULL AND targetstate IS NULL AND announceddate IS NULL THEN 1 ELSE NULL END AS MAvsIPO
 +
FROM exitkeyscleaniponoexit as A;
 +
--2049
  
 
I then wrote a query to check which date was lower and used that to indicate whether I chose the company to have undergone an MA or an IPO in the column MSvsIPO(I chose based on which process came first). A 0 in the column represented an MA being chosen and a 1 represented an IPO being chosen.
 
I then wrote a query to check which date was lower and used that to indicate whether I chose the company to have undergone an MA or an IPO in the column MSvsIPO(I chose based on which process came first). A 0 in the column represented an MA being chosen and a 1 represented an IPO being chosen.
Line 395: Line 457:
 
  FROM IPOMAForReview;
 
  FROM IPOMAForReview;
 
  --92
 
  --92
 +
 +
 +
Then out of this table I extracted the MAs and IPOs using the the created MAvsIPO flag:
 +
 +
DROP TABLE MASelected;
 +
CREATE TABLE MASelected AS
 +
SELECT A.coname, A.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo
 +
FROM IPOMASelected AS B LEFT JOIN companybasecore AS A ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE mavsipo=0;
 +
--25
 +
DROP TABLE IPOSelected;
 +
CREATE TABLE IPOSelected AS
 +
SELECT A.coname, A.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo
 +
FROM IPOMASelected AS B LEFT JOIN companybasecore AS A ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE mavsipo=1;
 +
--67
 +
 +
I then put together all of the IPOs that I selected into one table and all of the MAs I selected into another table. I did this using UNION statements. I did this because I didn't want to duplicate any IPOs and UNION acts as a SELECT DISTINCT statement. Thus if the number of the two tables added together equaled the final count of the table, I did not have any duplicate rows.
 +
DROP TABLE SelectedIPOSAll;
 +
CREATE TABLE SelectedIPOSALL AS
 +
SELECT A.coname, A.statecode, A.datefirstinv, A.targetname, A.targetstate, A.announceddate, A.issuername, A.issuerstate, A.issuedate, A.mavsipo FROM
 +
ExitKeysCleanIPO AS A
 +
UNION
 +
SELECT IPOSelected.* FROM
 +
IPOSelected;
 +
--2116
 +
--Makes sense because 2049+67=2116
 +
 +
DROP TABLE SelectedMASAll;
 +
CREATE TABLE SelectedMASALL AS
 +
SELECT A.coname, A.statecode, A.datefirstinv, A.targetname, A.targetstate, A.announceddate, A.issuername, A.issuerstate, A.issuedate, A.mavsipo FROM
 +
ExitKeysCleanMA AS A
 +
UNION
 +
SELECT MASelected.* FROM
 +
MASelected;
 +
--7121
 +
--Makes sense because 25+7096=7121
 +
 +
I then checked both of these files to make sure that their primary keys were still distinct and thus valid.
 +
SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM SelectedMasAll)a;
 +
--7121
 +
 +
SELECT COUNT(*) FROM(SELECT targetname, targetstate, announceddate FROM SelectedMasAll)a;
 +
--7121
 +
 +
SELECT COUNT(*) FROM(SELECT issuername, issuerstate, issuedate FROM SelectedIPOMA)a;
 +
--2116
 +
 +
SELECT COUNT(*) FROM(SELECT DISTINCT issuername, issuerstate, issuedate FROM SelectedIPOMA)a;
 +
--2116
 +
 +
I combined the two tables and checked the result to make sure it Unioned correctly.
 +
 +
DROP TABLE SelectedIPOMA;
 +
CREATE TABLE SelectedIPOMA AS
 +
SELECT A.*
 +
FROM SelectedMASALL AS A
 +
UNION
 +
SELECT B.*
 +
FROM SelectedIPOSALL AS B;
 +
--9237
 +
 +
SELECT COUNT(*) FROM SelectedIPOMA WHERE mavsipo=0;
 +
--7121
 +
 +
SELECT COUNT(*) FROM SelectedIPOMA WHERE mavsipo=1;
 +
--2116
 +
 +
SELECT COUNT(*) FROM SelectedIPOMA WHERE targetname IS NOT NULL;
 +
--7188
 +
 +
SELECT COUNT(*) FROM SelectedIPOMA WHERE issuername IS NOT NULL;
 +
--2141
 +
 +
Everything seems to check out, so we can move on to joining the SelectedIPOMA table to the companybasecore table to create the ExitKeysClean table.
 +
DROP TABLE ExitKeysClean;
 +
CREATE TABLE ExitKeysClean AS
 +
SELECT A.coname, A.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo
 +
FROM companybasecore AS A
 +
LEFT JOIN SelectedIPOMA AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;
 +
--48000
 +
 +
Since 48000 is the same number of rows in the companybasecore, we know that the join was successful and no rows were added that we don't want.
 +
 +
==Create the PortCoExit Table==
 +
From consulting with Ed and the VC Database Rebuild wiki, I decided to make the PortCoExit table with an mavsipo, an exitdate, an exited, and an exitvalue column. DROP TABLE PortCoExit;
 +
CREATE TABLE PortCoExit AS
 +
SELECT A.coname, A.statecode, A.datefirstinv,
 +
CASE WHEN a.mavsipo=0 THEN 0::int WHEN a.mavsipo=1 THEN 1::int ELSE NULL::int END AS mavsipo,
 +
CASE WHEN a.mavsipo=0 THEN B.announceddate WHEN a.mavsipo=1 THEN C.issuedate ELSE NULL::date END AS exitdate,
 +
CASE WHEN a.mavsipo=0 OR a.mavsipo=1 THEN 1::int ELSE 0::int END AS exited,
 +
CASE WHEN a.mavsipo=0 THEN B.transactionamt WHEN a.mavsipo=1 THEN C.proceedsamt ELSE NULL::money END AS exitvalue
 +
  FROM ExitKeysClean AS A
 +
LEFT JOIN MANoDups AS B ON A.targetname=B.targetname AND A.targetstate=B.targetstate AND A.announceddate=B.announceddate
 +
LEFT JOIN IPONoDups AS C ON A.issuername=C.issuer AND A.issuerstate=C.statecode AND A.issuedate=C.issuedate;
 +
--48000
 +
 +
The issue that I am currently having is that the exitvalue column for MAs is always Null, and there are some weird numbers in the table. I will continue to look for these. I have been continuously checking for duplicates and validity of the primary keys, so I do not believe these problems to be due to duplicates

Revision as of 12:54, 23 July 2018


McNair Project
VentureXpert Data
Project logo 02.png
Project Information
Project Title VentureXpert Data
Owner Augi Liebster
Start Date June 20, 2018
Deadline
Primary Billing
Notes
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


Augi Liebster (Work Log)

Relevant Former Projects

  1. Venture Capital (Data)
  2. Retrieving US VC Data From SDC
  3. VC Database Rebuild

Location

My scripts for SDC pulls are located in the Z drive in the location:

E:\VentureXpert Database\ScriptsForSDCExtract

My successfully pulled and normalized files are stored in the location:

E:\VentureXpert Database\ExtractedDataQ2

My script for loading data is in one big text file in the location:

E:\VentureXpert Database\vcdb3\LoadingScripts

The folder vcdb2 is there for reference to see what people before had done. ExtractedData is there because I pulled data before July 1st, and Ed asked me to repull the data.


Goal

I will be looking to redesign the VentureXpert Database in a way that is more intuitively built than the previous one. I will also update the database with current data.

Initial Stages

The first step of the project was to figure out what primary keys to use for each major table that I create. I looked at the primary keys used in the creation of the VC Database Rebuild and found primary keys that are decent. I have updated them and list them below:

  1. CompanyBaseCore- coname, statecode, datefirstinv
  2. IPOCore- issuer, issuedate, statecode
  3. MACore- target name, target state code, announceddate
  4. Geo - city, statecode, coname, datefirst, year
  5. DeadDate - conname, statecode, datefirst, rounddate (tentative could still change)
  6. RoundCore- conname, statecode, datefirst, rounddate
  7. FirmBaseCore - firmname
  8. FundBaseCore - fund name (firstinvedate doesn't work because not every row has an entry)

These are my initial listings and I will come back to update them if needed.

The second part of the initial stage has been to pull data from the SDC Platinum platform. I did it in July to ensure that I had two full quarters of data.


SDC Pull

When pulling data from SDC, it is a good idea to look for previously made rpt files that have the names of the pulls you will need to do. They have already been created and will save you a lot of work. The rpt files that I used are in the folder VentureXpertDB/ScriptsForSDCExtract. The files will come in pairs with one being saved as an ssh file and one as a rpt file. To update the dates to make them recent, go into the ssh file of the pair and change the date of last investment. When you open SDC, you will be given a variety of choices for which database to pull from. For each type of file chose the following:

  1. VentureXpert - PortCo, PortCoLong, USVC, Firms, BranchOffices, Funds, Rounds, VCFirmLong
  2. Mergres & Acquisition - MAs
  3. Global New Issues Databases - IPOs

Help on pulling data from SDC is on the SDC Platinum (Wiki) page.

VCFund Pull Problem

When pulling the VCFund1980-Present, I encountered two problems. One, is that SDC is not able to sort through the funds that are US only with the built in filters. Two, there are multiple rpt files that specify different variables for the fund pull. I pulled from both to be safe, but in the VC Database Rebuild page there is a section on the fund pull where Ed specifies which rpt file he used to pull data from SDC. Regardless I have both saved in the ExtractedData folder. After speaking with Ed, he told me to use the VCFund1980-present.rpt file to extract the data. Had various problems extracting data including freezing of SDC program or getting error Out of Memory. Check the SDC Platinum (Wiki) page to fix these issues.


Loading Tables

When I describe errors I encountered, I will not describe them using line numbers. This is because as soon as any data is added, the line numbers will become useless. Instead I recommend that you copy the normalized file you are working with into an excel file and using the filter feature. This way you can find the line number in your specific file that is causing errors and fix it in the file itself. The line numbers that PuTTY errors display are often wrong, so I relied on excel to discover the error fastest. If my instructions are not enough for you to find the error, my advice would be to find key words in the line that PuTTY is telling you is causing errors and filter through excel.


DROP TABLE roundbase;
CREATE TABLE roundbase (
  coname varchar(255),
  rounddate date,
  updateddate date,
  foundingdate date,
  datelastinv date,
  datefirstinv date,
  investedk real,
  city varchar(100),
  description varchar(5000),
  msa varchar(100),
  msacode varchar(10),
  nationcode varchar(10),
  statecode varchar(10),
  addr1 varchar(100),
  addr2 varchar(100),
  indclass varchar(100),
  indsubgroup3 varchar(100),
  indminor varchar(100),
  url varchar(5000),
  zip varchar(10),
  stage1 varchar(100),
  stage3 varchar(100),
  rndamtdisck real,
  rndamtestk real,
  roundnum integer,
  numinvestors integer
);
\COPY roundbase FROM 'USVC1980-2018q2-Good.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--151549

The only error I encountered here was with Cardtronic Technology Inc. Here there was a problem with a mixture of quotation marks which cause errors in loading. Find this using the excel trick and remove it manually.

DROP TABLE ipos;
CREATE TABLE ipos (
  issuedate date,
  issuer varchar(255),
  statecode varchar(10), 
  principalamt money, --million
  proceedsamt money, --sum of all markets in million
  naiccode varchar(255), --primary NAIC code
  zipcode varchar(10),
  status varchar (20),
  foundeddate date
);
\COPY ipos FROM 'IPO1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--12107

I encountered no errors while loading this data.


DROP TABLE branchoffices;
CREATE TABLE branchoffices (
  firmname varchar(255),
  bocity varchar(100),
  bostate varchar(2),
  bocountrycode varchar(2),
  bonation varchar(100),
  bozip varchar(10),
  boaddr1 varchar(100),
  boaddr2 varchar(100)

);
\COPY branchoffices FROM 'USVCFirmBranchOffices1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--10353

I encountered no errors while loading this data.

DROP TABLE roundline;
CREATE TABLE roundline (
  coname varchar(255),
  statecode varchar(2),
  datelastinv date,
  datefirstinv date,
  rounddate date,
  disclosedamt money,
  investor varchar(255)
);
\COPY roundline FROM 'USVCRound1980-2018q2-NoFoot-normal-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--403189

I encountered no errors while loading this data.

DROP TABLE fundbase;
CREATE TABLE fundbase (
  fundname varchar(255),
  closedate date, --mm-dd-yyyy
  lastinvdate date, --mm-dd-yyyy
  firstinvdate date, --mm-dd-yyyy
  numportcos integer,
  investedk real,
  city varchar(100),
  fundyear varchar(4), --yyyy
  zip varchar(10),
  statecode varchar(2),
  fundsizem real,
  fundstage varchar(100),
  firmname varchar(255),
  dateinfoupdate date,
  invtype varchar(100),
  msacode varchar(10),
  nationcode varchar(10),
  raisestatus varchar(100),
  seqnum integer,
  targetsizefund real,
  fundtype varchar(100)
);
\COPY fundbase FROM 'VCFund1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--29397

There is a Ukranian fund that has stray quotation marks in its name. It is called something along the lines of "VAT "ZNVKIF "Skhidno-Evropeis'lyi investytsiynyi Fond". If this does not help, you can filter in excel using Kiev as the keyword in the city column and find the line where you are getting errors. Then manually remove the commas in the actual text file. After that, the table should load correctly.

DROP TABLE firmbase;
CREATE TABLE firmbase(
  firmname varchar(255),
  foundingdate date, --mm-dd-yyyy
  datefirstinv date, --mm-dd-yyyy  
  datelastinv date, --mm-dd-yyyy
  addr1 varchar(100),
  addr2 varchar(100),
  location varchar(100),
  city varchar(100),
  zip varchar(10),
  areacode integer,
  county varchar(100),
  state varchar(2),
  nationcode varchar(10),
  nation varchar(100),
  worldregion varchar(100),
  numportcos integer,
  numrounds integer,
  investedk money,
  capitalundermgmt money,  
  invstatus varchar(100),
  msacode varchar(10),
  rolepref varchar(100),
  geogpref varchar(100),
  indpref varchar(100),
  stagepref varchar(100),
  type varchar(100)
);


\COPY firmbase FROM 'USVCFirms1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--15899

The normalization for this file was wrong when I tried to load the data. To fix this go to the file where you have removed the footer and find the column header titled Firm Capital under Mgmt{0Mil}. Delete the {0mil} and renormalize the file. Then everything should be ok. A good way to check this is to copy and paste the normalized file into an excel sheet and see whether the entries line up with their column header correctly. The second error I found was with the Kerala Ventures firm. Here the address has the word l"opera in it. This quotation will cause errors so find the line number using excel and remove it manually. The third error is in an area code where 1-8 is written. This hyphen causes errors. Interestingly, the line number given by PuTTY was correct, and I found it in my text file and deleted it manually. These were the only errors I encountered while loading this table.


DROP TABLE mas;
CREATE TABLE mas (
  announceddate date,
  effectivedate date,
  targetname varchar(255),
  targetstate varchar(100),
  acquirorname varchar(255),
  acquirorstate varchar(100),
  transactionamt money,
  enterpriseval varchar(255),
  acquirorstatus varchar(150)
);
\COPY mas FROM 'MAUSTargetComp100pc1985-July2018-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--119432


Instructions on Matching PortCos to Issuers and M&As From Ed

Get portco keys

DROP TABLE portcokeys;
CREATE TABLE portcokey AS
SELECT coname, statecode, datefirst
FROM portcocore;
--CHECK COUNT IS SAME AS portcocore OR THESE KEYS ARE VALID AND FIX THAT FIRST

Get distinct coname and put it in a file

\COPY (SELECT DISTINCT coname FROM portcokeys) TO 'DistinctConame.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV

Match that to itself

Move DistinctConame.txt to E:\McNair\Software\Scripts\Matcher\Input
Open powershell and change directory to E:\McNair\Software\Scripts\Matcher
Run the matcher in mode2:
 perl Matcher.pl -file1="DistinctConame.txt" -file2="DistinctConame.txt" -mode=2
Pick up the output file from E:\McNair\Software\Scripts\Matcher\Output (it is probably called DistinctConame.txt-DistinctConame.txt.matched) and move it to your Z drive directory

Load the matches into the dbase

DROP TABLE PortcoStd;
CREATE TABLE PortcoStd (
   conamestd  varchar(255),
   coname   varchar(255),
   norm  varchar(100),
   x1  varchar(255),
   x2  varchar(255)
);

\COPY CohortCoStd FROM 'DistinctConame.txt-DistinctConame.txt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--YOUR COUNT

Join the Conamestd back to the portcokeys table to create your matching table

DROP TABLE portcokeysstd;
CREATE TABLE portcokeysstd AS
SELECT B.conamestd, A.*
FROM portcokey AS A
JOIN PortcoStd AS B ON A.coname=B.coname
--CHECK COUNT IS SAME AS portcokey OR YOU LOST SOME NAMES OR INFLATED THE DATA

Put that in a file for matching (conamestd is in first column by construction)

 \COPY portcokeysstd TO 'PortCoMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
 --YOUR COUNT


Now prepare to repeat that process for M&A's and IPOs:

  • For M&As your keys (for now) will be targetname, statecode, dateannounced
  • For IPOs your keys (for now) will be issuername, statecode, issuedate
  • FIRST CLEAN EACH DATASET. The easiest way to remove duplicates (if you have lots of them) is to use an aggregate query:
DROP TABLE IPOCoreNoDups;
CREATE TABLE IPOCoreNoDups as
SELECT issuername, statecode, issuedate, max(var1) as var1, avg(var2) as var2, ...
FROM IPOCore GROUP BY issuername, statecode, issuedate ORDER BY issuername, statecode, issuedate;

Note that you need all vars to be inside aggregates and that you should choose the aggregate function sensibly by looking at the data. Generally use MAX for amounts and MIN for dates. You can also use MAX or MIN on text strings.

And now build the same stacks as before but to create Issuerkeystd and TargetKeystd (or whatever you call them). Make sure that issuerstd (and targetnamestd) is in the first column.

Now match Portcokeystd to Issuerkeystd, and match Portcokeystd to Targetkeystd

  • Move the files into the input director as before
  • Run the matcher script but WITHOUT mode 2:
perl Matcher.pl -file1="PortCoMatchInput.txt" -file2="IssuerMatchInput.txt"
perl Matcher.pl -file1="PortCoMatchInput.txt" -file2="TargetMatchInput.txt"

Open each of these files in excel and mark good matches with 1s and bad matches with 0s by adding columns to compare dates, states, etc, and filtering.

When you are done:

  • Build a new sheet of just good matches.
  • Save the excel files
  • Copy each of your match sheets to a text file
  • CREATE TABLE to reflect the data you are going to load (include std names and keys)
  • \COPY the data (using the exact copy command above but changing the table and file names) into the table
  • Celebrate!
  • Next we'll deal with any firms that have an IPO and an M&A and decide which we'll keep
  • And then we'll join in the chosen IPO and M&A data and move on!

Cleaning IPO and MA Data

It is important to follow Ed's direction of cleaning the data using aggregate function before putting the data into excel. This will keep you from a lot of manual checking that is unnecessary. When ready, paste the data you have into an excel file. In that excel file, I made three columns: one to check whether state codes were equivalent, one checking whether the date of first investment was 3 years before the MA or IPO, and one checking whether both of these conditions were satisfied for each company. I did this using simple if statements. This process is manual checking and filtering to see whether matches are correct or not and are thus extremely subjective and tedious. First, I went through and checked the companies that did not have equivalent state codes. If the company was one that I knew or the name was unique to the point that I did not believe the same name would appear in another state, I marked the state codes as equivalent. I did the same for the date of first investment vs MA/IPO date. Then I removed all duplicates that had the marking Warning Multiple Matches, and the data sheets were clean.


Finding Companies that Underwent IPOs and MAs

  1. Load IPOClean and MAClean into the database.
  2. Perform an INNER JOIN on the two tables in order to find the companies that underwent both MAs and IPOs. Do this by joining on the primary keys of portcos in both tables. Load this table into an excel sheet and manually find which companies you want to keep as MAs and which you want to keep as IPOs. Make sure to keep the portco primary key in this table.
  3. Load the decided IPO and MA data back into the database, including the primary keys of the portcos.
  4. LEFT JOIN the MA table with the IPO table. Join on primary keys of portcos. Select the companies where the IPO table are null as these are the companies that only had MAs. Do the same for IPOs. Now you have tables of companies that underwent only MAs and only IPOs.
  5. Join the companies that underwent IPOs only and the chosen IPOs back to the original key using the primary key of the company which must be in both tables. Repeat this for the MA table.

Ed's version

Note: your version isn't wrong. I'm just trying to be clearer.

Finding Companies that Underwent IPOs and MAs

  1. Load IPOClean and MAClean into the database.
  2. Make a table of keys by doing a JOIN on the two tables in order to find the companies that underwent both MAs and IPOs. (Do this by joining on the primary keys of portcos in both tables.) Make another table by left joining these results back with IPOClean and MAClean. COPY this table into an excel sheet and manually find which companies you want to keep as MAs and which you want to keep as IPOs. Make sure to keep the portco primary key in this table.
  3. Load the decided IPO and MA data back into the database into a table called IPOMASelected, including the primary keys of the portcos, and the primary keys of the MAs and IPOs selected.
  4. Make a table of primary keys of portcos and MAs and IPOs called ExitKeysClean. Do this by taking the IPOs which didn't have MAs from IPOClean and the MAs which didn't have IPOs from MAClean and the selected MAs and IPOs from IPOMASelected. Note that one way to do this is to LEFT JOIN the MA table with the IPO table. Select the companies where the IPO table are null as these are the companies that only had MAs. Do the same for IPOs. Now you have tables of companies that underwent only MAs and only IPOs.
  5. Make a table called PortCoExits by compiling information from ExitKeysClean Left joined to your main (no duplicates) IPO and MA tables, and using CASE WHEN statements.

Example Join on a composite Key:

DROP TABLE IPOMAForReview;
CREATE TABLE IPOMAForReview
SELECT A.*, B.targetname, B.targetstate, B.announcedate
FROM IPOClean AS A
JOIN MAClean AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;

Example CASE WHEN statement:

DROP TABLE PortCoExits;
CREATE TABLE PortCoExits 
SELECT A.coname, A.statecode, A.datefirst, 
CASE WHEN b.coname IS NOT NULL THEN 1::int WHEN c.coname IS NOT NULL THEN 0::int ELSE NULL::int END AS IPOvsMA,
CASE WHEN b.coname IS NOT NULL THEN b.proceedsall WHEN c.coname IS NOT NULL THEN c.transactionvalue ELSE NULL::int END AS exitvalue,
...
FROM ExitKeysClean AS A
LEFT JOIN IPONoDups AS B ON A.issuername=B.issuername AND A.statecode=B.statecode AND A.issuedate=B.issuedate
LEFT JOIN MANoDups AS B ON A.targetname=B.targetname AND A.statecode=B.statecode AND A.announcedate=B.announcedate;

When putting values together, make sure that they are in the same units (pref $m).

Process For Creating the PortCoExits Table

Even if you manually checked the excel sheet for Warning Multiple Matches with the Hall warning, there still may be duplicates. Check this by running:

SELECT COUNT(*) FROM(SELECT targetname, targetstate, announceddate FROM MAClean)a;
--7206
SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM MAClean)a;
--7189
SELECT COUNT(*) FROM(SELECT issuername, issuerstate, issuedate FROM IPOClean)a;
--2146
SELECT COUNT(*) FROM(SELECT DISTINCT issuername, issuerstate, issuedate FROM IPOClean)a;
--2141


As you can see, I still have duplicates in both the MAClean and IPOClean files. I ran an aggregate function to get rid of these duplicates:

DROP TABLE MACleanNoDups;
CREATE TABLE MACleanNoDups AS
SELECT targetname, targetstate, announceddate, MIN(x1) as x1, MIN(x2) as x2, MIN(method) as method, MIN(conamestd) as conamestd, MIN(coname) as coname, MIN(statecode) as 
statecode, MIN(datefirstinv) as datefirstinv, MIN(targetnamestd) as targetnamestd
FROM MAClean
GROUP BY targetname, targetstate, announceddate;
--7189

There are two companies that have the name Masspower in the MACleanNoDups file. One is written in all caps and will thus not be caught by the aggregate function. You will have to find it manually and delete it in order to ensure that your joining of MAs to companybasecore will not add in extra rows.

DELETE FROM MaCleanNoDups WHERE coname='Masspower' AND announceddate='2006-03-15';
--7188

Now do the same for the IPOs.

DROP TABLE IPOCleanNoDups;
CREATE TABLE IPOCleanNoDups AS
SELECT issuername, issuerstate, issuedate, MIN(x1) as x1, MIN(x2) as x2, MIN(method) as method, MIN(conamestd) as conamestd, MIN(coname) as coname, MIN(statecode) as statecode, 
MIN(datefirstinv) as datefirstinv, MIN(issuernamestd) as issuernamestd
FROM IPOClean
GROUP BY issuername, issuerstate, issuedate;
--2141

Now the duplicates are out of the MAClean and IPOClean data and we can start to construct the ExitKeysClean table.

Creating ExitKeysClean=

First I looked for the PortCos that were in both the MAs and the IPOs. I did this using:

DROP TABLE IPOMAForReview;
CREATE TABLE IPOMAForReview
SELECT A.*, B.targetname, B.targetstate, B.announcedate
FROM IPOClean AS A
JOIN MAClean AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;

--92

I then pulled out the IPOs that were only IPOs and MAs that were only MAs.

DROP TABLE ExitKeysCleanIPONoExit;
CREATE TABLE ExitKeysCleanIPONoExit AS
SELECT A.*, B.targetname, B.targetstate, B.announceddate
FROM IPOCleanNoDups AS A LEFT JOIN MACleanNoDups AS B ON A.coname=B.Coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE B.statecode IS NULL AND 
B.coname IS NULL AND B.datefirstinv IS NULL;
--2049
DROP TABLE ExitKeysCleanMANoExit;
CREATE TABLE ExitKeysCleanMANoExit AS
SELECT A.*, B.issuername, B.issuerstate, B.issuedate
FROM MACleanNoDups AS A LEFT JOIN IPOCleanNoDups AS B ON A.coname=B.Coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE B.statecode IS NULL AND 
B.coname IS NULL AND B.datefirstinv IS NULL;
--7096

Since 2141-92=2019 and 7188-92=1096, we know that the duplicate companies were extracted successfully.

I then created a column that identifies whether a company underwent an MA or an IPO. A 0 indicates an MA and a 1 indicates an IPO.

DROP TABLE ExitKeysCleanMA;
CREATE TABLE ExitKeysCleanMA AS
SELECT *,
CASE 
 WHEN issuername IS NULL AND issuerstate IS NULL AND issuedate IS NULL THEN 0 ELSE NULL END AS MAvsIPO
FROM exitkeyscleanmanoexit;
--7096
DROP TABLE ExitKeysCleanIPO;
CREATE TABLE ExitKeysCleanIPO AS
SELECT *,
CASE 
 WHEN targetname IS NULL AND targetstate IS NULL AND announceddate IS NULL THEN 1 ELSE NULL END AS MAvsIPO
FROM exitkeyscleaniponoexit as A;
--2049

I then wrote a query to check which date was lower and used that to indicate whether I chose the company to have undergone an MA or an IPO in the column MSvsIPO(I chose based on which process came first). A 0 in the column represented an MA being chosen and a 1 represented an IPO being chosen.

DROP TABLE IPOMASelected; 
CREATE TABLE IPOMASelected AS 
SELECT *,
CASE 
 WHEN issuedate < announceddate THEN 1 ELSE 0 END AS MAvsIPO
FROM IPOMAForReview;
--92


Then out of this table I extracted the MAs and IPOs using the the created MAvsIPO flag:

DROP TABLE MASelected;
CREATE TABLE MASelected AS
SELECT A.coname, A.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo
FROM IPOMASelected AS B LEFT JOIN companybasecore AS A ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE mavsipo=0;
--25
DROP TABLE IPOSelected;
CREATE TABLE IPOSelected AS
SELECT A.coname, A.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo
FROM IPOMASelected AS B LEFT JOIN companybasecore AS A ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE mavsipo=1;
--67

I then put together all of the IPOs that I selected into one table and all of the MAs I selected into another table. I did this using UNION statements. I did this because I didn't want to duplicate any IPOs and UNION acts as a SELECT DISTINCT statement. Thus if the number of the two tables added together equaled the final count of the table, I did not have any duplicate rows.

DROP TABLE SelectedIPOSAll;
CREATE TABLE SelectedIPOSALL AS
SELECT A.coname, A.statecode, A.datefirstinv, A.targetname, A.targetstate, A.announceddate, A.issuername, A.issuerstate, A.issuedate, A.mavsipo FROM
ExitKeysCleanIPO AS A
UNION
SELECT IPOSelected.* FROM
IPOSelected;
--2116
--Makes sense because 2049+67=2116
DROP TABLE SelectedMASAll;
CREATE TABLE SelectedMASALL AS
SELECT A.coname, A.statecode, A.datefirstinv, A.targetname, A.targetstate, A.announceddate, A.issuername, A.issuerstate, A.issuedate, A.mavsipo FROM
ExitKeysCleanMA AS A
UNION
SELECT MASelected.* FROM
MASelected;
--7121
--Makes sense because 25+7096=7121

I then checked both of these files to make sure that their primary keys were still distinct and thus valid.

SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM SelectedMasAll)a;
--7121
SELECT COUNT(*) FROM(SELECT targetname, targetstate, announceddate FROM SelectedMasAll)a;
--7121
SELECT COUNT(*) FROM(SELECT issuername, issuerstate, issuedate FROM SelectedIPOMA)a;
--2116
SELECT COUNT(*) FROM(SELECT DISTINCT issuername, issuerstate, issuedate FROM SelectedIPOMA)a;
--2116

I combined the two tables and checked the result to make sure it Unioned correctly.

DROP TABLE SelectedIPOMA;
CREATE TABLE SelectedIPOMA AS
SELECT A.*
FROM SelectedMASALL AS A
UNION
SELECT B.*
FROM SelectedIPOSALL AS B;
--9237
SELECT COUNT(*) FROM SelectedIPOMA WHERE mavsipo=0;
--7121
SELECT COUNT(*) FROM SelectedIPOMA WHERE mavsipo=1;
--2116
SELECT COUNT(*) FROM SelectedIPOMA WHERE targetname IS NOT NULL;
--7188

SELECT COUNT(*) FROM SelectedIPOMA WHERE issuername IS NOT NULL;
--2141

Everything seems to check out, so we can move on to joining the SelectedIPOMA table to the companybasecore table to create the ExitKeysClean table.

DROP TABLE ExitKeysClean;
CREATE TABLE ExitKeysClean AS
SELECT A.coname, A.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo
FROM companybasecore AS A 
LEFT JOIN SelectedIPOMA AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;
--48000

Since 48000 is the same number of rows in the companybasecore, we know that the join was successful and no rows were added that we don't want.

Create the PortCoExit Table

From consulting with Ed and the VC Database Rebuild wiki, I decided to make the PortCoExit table with an mavsipo, an exitdate, an exited, and an exitvalue column. DROP TABLE PortCoExit;

CREATE TABLE PortCoExit AS
SELECT A.coname, A.statecode, A.datefirstinv,
CASE WHEN a.mavsipo=0 THEN 0::int WHEN a.mavsipo=1 THEN 1::int ELSE NULL::int END AS mavsipo,
CASE WHEN a.mavsipo=0 THEN B.announceddate WHEN a.mavsipo=1 THEN C.issuedate ELSE NULL::date END AS exitdate,
CASE WHEN a.mavsipo=0 OR a.mavsipo=1 THEN 1::int ELSE 0::int END AS exited,
CASE WHEN a.mavsipo=0 THEN B.transactionamt WHEN a.mavsipo=1 THEN C.proceedsamt ELSE NULL::money END AS exitvalue
 FROM ExitKeysClean AS A
LEFT JOIN MANoDups AS B ON A.targetname=B.targetname AND A.targetstate=B.targetstate AND A.announceddate=B.announceddate
LEFT JOIN IPONoDups AS C ON A.issuername=C.issuer AND A.issuerstate=C.statecode AND A.issuedate=C.issuedate;
--48000

The issue that I am currently having is that the exitvalue column for MAs is always Null, and there are some weird numbers in the table. I will continue to look for these. I have been continuously checking for duplicates and validity of the primary keys, so I do not believe these problems to be due to duplicates