Difference between revisions of "Venture Capital (Data)"

From edegan.com
Jump to navigation Jump to search
 
(51 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Data,Tool
 +
|Has sponsor=McNair Center
 
|Has title=Venture Capital (Data)
 
|Has title=Venture Capital (Data)
|Has owner=Jake Silberman
+
|Has owner=Adrian Smart, Jake Silberman, Meghana Gaur,
|Has notes=
 
|Is dependent on=
 
|Depends upon it=
 
|Has project status=Active
 
 
|Has keywords=Data
 
|Has keywords=Data
 +
|Has project status=Subsume
 
}}
 
}}
 +
==NOTICE==
  
==Arching Objective==
+
This project page is largely redundant and needs to be cleaned up. The main project pages for this project are:
To create a master venture capital data set on which all other projects requiring venture capital data will be built.
+
[[Retrieving US VC Data From SDC]], and [[VC Database Rebuild]].
  
==Variables to Pull==
 
===Company Level===
 
*coname - Company Name
 
*state - Company State
 
*Industry - Company VE Industry Class
 
*address - Company Area Cd or Company North American Location
 
*date of founding - Company Founding Date
 
*date of first investment - Date Company Rcvd its First Investment
 
*date of last investment - Date Company Rcvd its Last Investment
 
*total amount invested - Total Known Amount Invested in Company
 
*MSA, CMSA - Company MSA
 
*city - Company City
 
*date of IPO - Company IPO Date
 
  
===round level===
+
==Project Objective==
====main round====
+
To create a master venture capital data set on which all other projects requiring venture capital data will be built.  
*round date - Round Date
 
*round amount - Round Amount, Disclosed ($ Thou) or Round Amount, Estimated ($ Thou)
 
*coname (or primary key) - Company Name
 
*round number - Round Number
 
*stage - STAGE1, STAGE2, STAGE3
 
====round on one line====
 
*round on one line - COMBINED_ROUND_INFO
 
**round date - Fund Portfolio Cos: Round Date
 
**round amount - Fund Portfolio Cos: Round Amount, Disclosed ($ Thou) or Fund Portfolio Cos: Round Amount, Estimated (Thou)
 
**name of investor - Fund Name
 
*fund type - Fund Type
 
*coname (or primary key) - Fund Portfolio Cos: Name
 
*E:\McNair\Software\Scripts contains normalizing script for round on one line. HubsCapture.pl
 
  
===Fund Level?===
+
Specifically, we will be building data to support the following:
*fund name
+
*Lists and counts by year as well as state, MSA, and city, zip code, etc.
*closing date
+
*Identify seed, early and later stage deals (first investment), dollars invested and rounds
*last investment
+
*Identify transactional VC dollars invested and rounds
*first investment
+
*Acquisitions and IPOs of VC backed firms
*msa
+
*Identify when VC backed firms are "alive"
*msa code
+
*Geocode all portfolio companies
*average investment
+
*Geocode VC headquarters and branch offices
*number of companies invested
+
*Compute distances between investors and their portfolio companies
*total known investment
+
*Identify lead investors for portfolio companies
*address
+
*Identify top 100 VCs
*fund size
 
  
==Data sets to build==  
+
==Retrieving data==
#number of deals (first investments) by state, industry, msa, city, year, quarter
 
#number of foundings (conditional on receiving vc later) by state, industry, msa, city, year, quarter
 
#deal amount (first investment)  by stage, state, industry, msa, city, year, quarter
 
#invested amount (any round) by stage, state, industry, msa, city, year, quarter
 
#number of deals by stage state, industry, msa, city, year, quarter
 
#number of investments (any round) by stage, state, industry, msa, city, year, quarter
 
#number of investors by stage, state, industry, msa, city, year, quarter
 
  
==Houston/Galveston/Brazoria Pull/Work==
+
Old notes are [[Old Venture Capital Data Work]]
*5 Pulls, all under MSA Houston/Galveston/Brazoria filter
 
**Firm Level
 
**Fund Level
 
**Company Level
 
**Round Level
 
**Round on One Line
 
  
===Work Details===
+
{{:Retrieving US VC Data From SDC}}
*SDC pulls at all 5 levels followed by normalization of data
 
*Data can be found in E:\McNair\Projects\Houston\MarketNonMarket (more details about which are specific to this project will be posted later) and files for import to SQL can be found in Z:\VentureCapitalData
 
*SQL tables are found in psql vcdata
 
*SQL code found in E:\McNair\Projects\Houston\MarketNonMarket\SQL Script 1
 
  
==Jake Project==
+
Additional datafiles (in E:\McNair\Projects\VC Database):
*1
+
*GeocodedVCData.txt 43,724 records, tab-delimited with companynames but with "none" for some geocoords.
Rank top 20 cities in the USA for venture capital. Rank by:
 
# # of deals (1st investments - deal number = 1 or rounddate=datefirstinv)
 
# amount invested per year (sum of round dollars)
 
# number of rounds made per year (count of rounds)
 
# number of IPOs per year (VC backed)
 
# number of acquisitions per year (VC backed)
 
# number of alive venture backed firms in a year (had a round within last 5 years and haven't exited)
 
*Alive - Examine health of entrepreneurship ecosystem within an MSA or CMSA and how it is affected by the proximity of venture-backed companies to one another (measured by minimum circle area between firms within the MSA). The goal is to see whether MSA's with closer or further proximity between companies see greater growth in successive years. In order to be considered "living," the companies must have not undergone IPO's, liquidations, or acquisitions and must have received venture funding within the last 5 years. If these requirements are not met, they are dropped from the data.
 
  
Throw out duplicate names for companies (take only the most recent one when duplicates and match on name and first inv).
+
==Processing the base tables==
  
===Work Details===
+
The SQL scripts need to do the following:
*See SQL database 'vcdata' for tables
+
*Extract company infomation from roundbase then process it
*SDC data imported into 'vcdata' found in Z:\VentureCapitalData
+
**Add variable for growth (1), transactional (2), or non-VC (3) (see below)
*Code found in E:\McNair\Projects\VentureCapitalData\SQL Scripts
+
**Create clean 5-digit ZIP field
 +
*Extract firm information from firmbase, then:
 +
**Add indicator variable for US
 +
**Create clean 5-digit ZIP field
 +
*Build out BranchOffice (BO) table
 +
*Build lookups for:
 +
**State Code
 +
**raisestatus
  
==SDC Pull==
+
SQL file to process data is:
 +
Z:/VentureCapitalData/SDCVC/ProcessData.sql
  
===Restrictions===
+
==Lookup Tables==
*Nation: U.S. State: TX MSA: Houston, Texas City-Galveston, Brazoria Year: 1980-2017. Venture Related: Yes
 
*Definition of Houston: MSA Codes for Galveston, Texas City, Brazoria, and Houston, and at state=TX
 
  
===Variables Pulled===
+
===Stage of Investment===
Pull round based info (through portfolio company perspective):
 
*Company Name
 
*Round Number
 
*Round Date
 
*Round Amount Disclosed (thou)
 
*Date Company Received Last Investment
 
*Date Company Received First Investment
 
*Company City
 
*Total Known Amount Invested in Company (thou)
 
*Company IPO Date
 
*Company Found Date
 
*Company Stage Level 3 at Round Date
 
*Company Industry Minor Group
 
*Company Industry Sub Group 3
 
*Company State (Code)
 
*Compnay MSA (Code)
 
*Company Street Address, Line 1
 
*Company Street Address, Line 2
 
*Company Zip Cd
 
  
===Exclusions and Classifications===
+
We are generating 6 binary indicators as follows:
  
  Seed/Early/Later (Real VC)
+
  Growth VC (4)
  Seed
+
--------------
  Early Stage
+
Seed (1)
  Later Stage
+
Early Stage (2)
 +
Later Stage (3)
  
  Other VC
+
  Tranactional VC (5)
  Acq. for Expansion
+
--------------
  Acquisition
+
Acq. for Expansion
  Bridge Loan
+
Acquisition
  Expansion
+
Bridge Loan
  Pending Acq
+
Expansion
  Recap or Turnaround
+
Pending Acq
 +
Recap or Turnaround
  
  Exclude (non-VC)
+
  Exclude (non-VC) (6)
  LBO
+
--------------
  MBO
+
LBO
  Open Market Purchase
+
MBO
   PIPE
+
Open Market Purchase
   Secondary Buyout
+
PIPE
 +
Secondary Buyout
 +
VC Partnership
 +
Other
 +
 
 +
==Building companybase and round==
 +
Both are built from roundbase which contains portfolio companies and round details. The scripts are below.
 +
CREATE TABLE companybase AS
 +
SELECT DISTINCT
 +
coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,
 +
statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip
 +
FROM roundbase
 +
ORDER BY coname;
 +
 
 +
CREATE TABLE round AS
 +
SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors
 +
FROM roundbase
 +
ORDER BY coname;
 +
 
 +
==Building stageflags==
 +
Stageflags are built from the stage3 column of round. Using the tables above, I made 6 flags for seed, early, later, growth, transaction and exclude. The script is below. Verify that no row entries are missing flags in excel by copying the round table to a txt file. The last 3 columns should always sum to 1 for every row entry. There should be no zero sums and no sums greater than 1.
 +
CREATE TABLE stageflags AS
 +
SELECT coname, stage3,
 +
CASE
 +
  WHEN stage3 = 'Seed' THEN 1
 +
  ELSE 0
 +
  END AS seedflag,
 +
CASE
 +
  WHEN stage3 = 'Early Stage' THEN 1
 +
  ELSE 0
 +
  END AS earlyflag,
 +
CASE
 +
  WHEN stage3 = 'Later Stage' THEN 1
 +
  ELSE 0
 +
  END AS laterflag,
 +
CASE
 +
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1
 +
  ELSE 0
 +
  END AS growthflag,
 +
CASE
 +
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1
 +
  ELSE 0
 +
   END AS transactionflag,
 +
CASE
 +
   WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout'
 +
OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1
 +
  ELSE 0
 +
  END AS excludeflag
 +
FROM round;
 +
 
 +
==Sample output==
 +
The psql file was copied to an output text file using:
 +
\COPY stageflags TO 'flag.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;
 +
A sum column was added in excel which is the sum of the last 3 columns. Notice how all entries are 1. This is good.
 +
 
 +
==Checking companybase table==
 +
The following psql code creates a new table called companybase1 that checks for Undisclosed companies and nonUS companies in the companybase table and also inserts a primary key column which is a concatenation of coname, statecode and datefirstinv.
 +
CREATE TABLE companybase1 AS
 +
SELECT *,
 +
concat(coname::text, statecode::text, datefirstinv::text) AS pkey,
 +
CASE
 +
  WHEN nationcode = 'US' THEN 1::int
 +
  ELSE 0::int
 +
  END AS alwaysusflag,
 +
CASE
 +
  WHEN coname = 'Undisclosed Company' THEN 1::int
 +
  ELSE 0::int
 +
  END AS undisclosedflag
 +
FROM companybase;
 +
\COPY companybase1 TO 'companybase1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;
 +
 
 +
==Checking if pkey is valid ==
 +
The following psql code creates a new table called companybasepkeyvalid that counts the number of times the pkey created in companybase1 occurs in companybase1. In this case there are 3 pkeys that are not valid for companies: Undisclosed Company, New York Digital Health LLC and Undisclosed Company.
 +
CREATE TABLE companybasepkeyvalid AS
 +
SELECT coname, pkey, COUNT(pkey)
 +
FROM companybase1
 +
GROUP BY coname, pkey;
 +
--44771
 +
\COPY companybasepkeyvalid TO 'companybasepkeyvalid.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;
 +
 
 +
The Undisclosed Companies will be updated manually using the description to extract the company name. To remove the duplicate records of the company New York Digital Health LLC from the roundbase table run the following command:
 +
DELETE FROM roundbase WHERE url = 'www.digitalhealthaccelerator.c';
 +
--1
 +
After the Undisclosed company names are updated, you will need to recreate the companybase and roundbase tables.
 +
Copy all the undisclosed companies to a text file and pull out the company names from the description.
 +
DROP TABLE undisclosedcompanies
 +
CREATE TABLE undisclosedcompanies AS
 +
SELECT coname, description, url  FROM roundbase
 +
WHERE coname = 'Undisclosed Company';
 +
--37
 +
\COPY undisclosedcompanies TO 'undisclosedcompanies.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;
 +
 
 +
After manually pulling out the company names from the description from all 37 entries, I created the following commands to update the original roundbase table below. After the updates there are still 18 Undisclosed companies. Recheck to make sure there are no primary key issues again.
 +
UPDATE roundbase SET coname = 'Standard Bariatrics' WHERE coname = 'Undisclosed Company' AND description = 'Standard
 +
Bariatrics is taking bariatric procedure and creating instruments';
 +
--1
 +
UPDATE roundbase SET coname = 'Jersey Watch' WHERE coname = 'Undisclosed Company' AND description = 'Jersey Watch provides
 +
organizers to professional.';
 +
--1
 +
UPDATE roundbase SET coname = 'CareSkore' WHERE coname = 'Undisclosed Company' AND description = 'CareSkore provides an
 +
evidence-based predictive systems.';
 +
--1
 +
UPDATE roundbase SET coname = 'Navistone' WHERE coname = 'Undisclosed Company' AND description = 'Navistone provides digital
 +
marketers a way to target the right web browsers';
 +
--2
 +
UPDATE roundbase SET coname = 'PolySync' WHERE coname = 'Undisclosed Company' AND description = 'Offers PolySync platform
 +
for driverless vehicle applications.';
 +
--2
 +
UPDATE roundbase SET coname = 'Platform9' WHERE coname = 'Undisclosed Company' AND description = 'Platform9 Systems Inc is a
 +
United States-based company.';
 +
--2
 +
UPDATE roundbase SET coname = 'Salamanca' WHERE coname = 'Undisclosed Company' AND description = 'Salamanca provides
 +
products and services.';
 +
--1
 +
UPDATE roundbase SET coname = 'Sourcegraph Inc.' WHERE coname = 'Undisclosed Company' AND description = 'Sourcegraph Inc.
 +
provides a code search engine for open source.';
 +
--1
 +
UPDATE roundbase SET coname = 'Sona' WHERE coname = 'Undisclosed Company' AND description = 'Sona provides products and
 +
services.';
 +
--1
 +
UPDATE roundbase SET coname = 'AlphaSource' WHERE coname = 'Undisclosed Company' AND description = 'AlphaSource in provides
 +
healthcare technology solutions.';
 +
--2
 +
UPDATE roundbase SET coname = 'Reflecktive' WHERE coname = 'Undisclosed Company' AND description = 'Reflecktive provides
 +
products and services.';
 +
--1
 +
UPDATE roundbase SET coname = 'SuperH' WHERE coname = 'Undisclosed Company' AND description = 'SuperH provides an SaaS
 +
platform for open source data';
 +
--1
 +
UPDATE roundbase SET coname = 'Islands Media Inc' WHERE coname = 'Undisclosed Company' AND description = 'Islands Media Inc
 +
is an IT company.';
 +
--1
 +
UPDATE roundbase SET coname = 'Envoy Commerce Corp.' WHERE coname = 'Undisclosed Company' AND description = 'Envoy Commerce
 +
Corp. provides online grocery delivery service.';
 +
--1
 +
UPDATE roundbase SET coname = 'Quilt' WHERE coname = 'Undisclosed Company' AND description = 'Quilt is a consumer-centric
 +
insurance company focused on renters insurances';
 +
--1
 +
 
 +
==Removing duplicates from ipos table==
 +
If you create the following two tables you'll notice that there are duplicate entries in the ipos table.
 +
DROP TABLE ipospkey;
 +
CREATE TABLE ipospkey AS
 +
SELECT *,
 +
concat(issuer::text, issuedate::text, statecode::text, principalamt::text) AS pkey
 +
FROM ipos;
 +
--10440
 +
 
 +
DROP TABLE iposcount;
 +
CREATE TABLE iposcount AS
 +
SELECT issuer, issuedate, statecode, COUNT(pkey)
 +
FROM ipospkey
 +
GROUP BY issuer, issuedate, statecode;
 +
--9491
 +
 
 +
To clean these up I built the following tables and selected the max principalamt values if there were duplicates. There is probably a simpler way to do this but I'll leave that for the next person to discover.
 +
DROP TABLE iposcounthigh;
 +
CREATE TABLE iposcounthigh AS
 +
SELECT *
 +
FROM iposcount
 +
WHERE count > 1;
 +
 
 +
DROP TABLE duplicateipodata;
 +
CREATE TABLE duplicateipodata AS
 +
SELECT ipos.*
 +
FROM ipos INNER JOIN iposcounthigh ON (ipos.issuer = iposcounthigh.issuer AND ipos.issuedate = iposcounthigh.issuedate AND
 +
ipos.statecode = iposcounthigh.statecode)
 +
ORDER BY principalamt DESC;
 +
 
 +
DROP TABLE ipoduplicatemin;
 +
CREATE TABLE ipoduplicatemin AS
 +
SELECT issuedate, issuer, statecode, MIN(principalamt) AS principalamt
 +
FROM duplicateipodata
 +
GROUP BY issuedate, issuer, statecode;
 +
--939
 +
\COPY ipoduplicatemin TO 'ipoduplicatemin.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
DROP TABLE ipoduplicatekeys;
 +
CREATE TABLE ipoduplicatekeys AS
 +
SELECT *,
 +
concat(issuer::text, issuedate::text, statecode::text, principalamt::text) AS pkey
 +
FROM ipoduplicatemin;
 +
 
 +
DROP TABLE iponoduplicates;
 +
CREATE TABLE iponoduplicates AS
 +
SELECT ipospkey.*
 +
FROM ipospkey
 +
WHERE NOT EXISTS(SELECT * FROM ipoduplicatekeys WHERE (ipospkey.pkey = ipoduplicatekeys.pkey));
 +
--9490
 +
 
 +
To verify that there are no more duplicates create another table.
 +
DROP TABLE iposcountdupes1;
 +
CREATE TABLE iposcountdupes1 AS
 +
SELECT issuer, issuedate, statecode, COUNT(pkey)
 +
FROM iponoduplicates
 +
GROUP BY issuer, issuedate, statecode;
 +
\COPY iposcountdupes1 TO 'iposcountdupes1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
You will notice that there are still 9 duplicate records that slipped thru somehow. Remove these manually.
 +
DELETE FROM iponoduplicates WHERE pkey = 'PacTel Corp1993-12-02CA$299.00';
 +
--1
 +
DELETE FROM iponoduplicates WHERE pkey = 'Templeton Dragon Fund Inc1994-09-21FL$169.50';
 +
--1
 +
DELETE FROM iponoduplicates WHERE pkey = 'Sterling Commerce1996-03-08TX$240.00' AND proceedsamt = '288';
 +
--1
 +
DELETE FROM iponoduplicates WHERE pkey = 'Sothebys Holdings Inc1988-05-13NY$27.00';
 +
--1
 +
DELETE FROM iponoduplicates WHERE pkey = 'TD Waterhouse Group Inc1999-06-23NY$655.20';
 +
--1
 +
DELETE FROM iponoduplicates WHERE pkey = 'Berlitz International Inc1989-12-13NJ$34.65';
 +
--1
 +
DELETE FROM iponoduplicates WHERE pkey = 'Spain Fund Inc1988-06-21NY$39.00';
 +
--1
 +
DELETE FROM iponoduplicates WHERE pkey = 'Ultramar Corp1992-06-26CT$85.39';
 +
--1
 +
DELETE FROM iponoduplicates WHERE pkey = 'Goldman Sachs Group Inc1999-05-03NY$424.00';
 +
--1
 +
 
 +
iponoduplicates is now the master ipo table that contains unique keys which are concatenated from issuer, issuedate and prinicpalamt. It should contain 9481 lines.
 +
 
 +
==Cleaning mas table==
 +
There were duplicate records in this file because the original text file from SDC contained addresses that were on two rows for some companies. Therefore the easiest way to get rid of these records is to remove lines in textpad and reimport into the mas table.
 +
Remove lines in textpad using regex: ^#.*\n and replacing with nothing.
 +
After reimporting there will still be duplicate rows. Slap an id number on them and take the min as shown below.
 +
CREATE TABLE mas1 AS
 +
SELECT *,
 +
concat(targetname::text, targetstate::text, announceddate::text) AS pkey
 +
FROM mas;
 +
ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;
 +
--114890
 +
 
 +
DROP TABLE masminid;
 +
CREATE TABLE masminid AS
 +
SELECT targetname, targetstate, announceddate, MIN(id)
 +
FROM mas1
 +
GROUP BY targetname, targetstate, announceddate;
 +
--114825
 +
 
 +
DROP TABLE masnodupes;
 +
CREATE TABLE masnodupes AS
 +
SELECT mas1.*
 +
FROM mas1 JOIN masminid ON mas1.id = masminid.min;
 +
\COPY masnodupes TO 'masnodupes.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
masnodupes should have no duplicates based on targetname, targetstate, announceddate. After running it thru the matcher you will get about 800 matching errors. We decided to exclude any rows that had equivalent targetname, targetstate where the announced dates fell within the same week. In this case we took the minimum date value and excluded the other row. Below are the queries to make those tables.
 +
 
 +
DROP TABLE masdistinctkeys;
 +
CREATE TABLE masdistinctkeys AS
 +
SELECT DISTINCT targetname, targetstate, announceddate
 +
FROM masnodupes;
 +
--114825
 +
\COPY masdistinctkeys TO 'masdistinctkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
DROP TABLE masdistinctkeysmindates;
 +
CREATE TABLE masdistinctkeysmindates AS
 +
SELECT targetname, targetstate, MIN(announceddate) AS announceddate
 +
FROM masdistinctkeys
 +
GROUP BY targetname, targetstate;
 +
--113236
 +
 
 +
DROP TABLE masdistinctkeysdatewindow;
 +
CREATE TABLE masdistinctkeysdatewindow AS
 +
SELECT masdistinctkeys.*, masdistinctkeysmindates.announceddate as minanndate,
 +
CASE WHEN masdistinctkeys.announceddate - INTERVAL '7 day' > masdistinctkeysmindates.announceddate THEN 1::int
 +
ELSE 0::int
 +
END AS dateflag
 +
FROM masdistinctkeys JOIN masdistinctkeysmindates ON (masdistinctkeys.targetname = masdistinctkeysmindates.targetname AND
 +
masdistinctkeys.targetstate = masdistinctkeysmindates.targetstate);
 +
--114825
 +
\COPY masdistinctkeysdatewindow TO 'masdistinctkeysdatewindow.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
CREATE TABLE masdistinctkeys1 AS
 +
SELECT targetname, targetstate, announceddate
 +
FROM masdistinctkeysdatewindow
 +
WHERE dateflag = 0;
 +
--113267
 +
\COPY masdistinctkeys1 TO 'masdistinctkeys1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
Now when you re-run the matcher on portcokeys and masdistinctkeys1 you will get 428 errors and excluding Undisclosed fields gets you down to 356.
 +
 
 +
In Excel create a column to exclude any entries where the announced date occurs before the firstinvdate. Afterwards, manually flag the other duplicate mismatches picking the earlier announced date.
 +
 
 +
==Import Flagged Matcher Data Into PSQL==
 +
After everything in the matcher output files is manually flagged create two new tables in Excel for iposmatcherouput and masmatcheroutput and copy those files into the tables.
 +
CREATE TABLE masmatcheroutputwithexcludeflag(
 +
  file1coname varchar(255),
 +
  file2targetname varchar(255),
 +
  warningmsg varchar(255),
 +
  file1coname1 varchar(255),
 +
  file1statecode varchar(2),
 +
  file1datefirstinv date, --mm-dd-yyyy
 +
  file2targetname2 varchar(255),
 +
  file2statecode varchar(2),
 +
  file2announceddate date,
 +
  exclude boolean,
 +
  excludemaster boolean,
 +
  excludemanual boolean,
 +
  excludemanualmaster boolean,
 +
  excludefinal boolean
 +
);
 +
\COPY masmatcheroutputwithexcludeflag FROM 'masmatcheroutput-nohead.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--9400
 +
 
 +
CREATE TABLE ipomatcheroutputwithexcludeflag(
 +
  file1coname varchar(255),
 +
  file2issuer varchar(255),
 +
  warningmsg varchar(255),
 +
  file1coname1 varchar(255),
 +
  file1statecode varchar(2),
 +
  file1datefirstinv date, --mm-dd-yyyy
 +
  file2issuer1 varchar(255),
 +
  file2issuedate date,
 +
  file2statecode varchar(2),
 +
  excludedate boolean,
 +
  excludemaster boolean,
 +
  excludemanual boolean,
 +
  excludemanualmaster boolean,
 +
  excludefinal boolean
 +
);
 +
\COPY ipomatcheroutputwithexcludeflag FROM 'iposmatcheroutput-nohead.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--2591
 +
 
 +
==Joining portco with ipo and mas==
 +
In order to match the company with the ipo or acquisition date we need to join the three tables together. Do this in a two step process. First join with ipo and then take the resulting table and join with mas. This will give you companies with an ipo column and acquisition column. There will be a lot of blanks because some companies didn't have ipos or acquisitions.
 +
DROP TABLE portcoipojoin;
 +
CREATE TABLE portcoipojoin AS
 +
SELECT coname, statecode, datefirstinv, matcher.file2issuer, matcher.file2issuedate
 +
FROM companybase LEFT JOIN ipomatcheroutputwithexcludeflag AS matcher ON companybase.coname = matcher.file1coname AND
 +
companybase.statecode = matcher.file1statecode AND companybase.datefirstinv = matcher.file1datefirstinv AND
 +
matcher.excludefinal = FALSE;
 +
--44773
 +
\COPY portcoipojoin TO 'portcoipojoin.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
Now join with the iposnoduplicates table to grab proceedamt and principalamt, etc.
 +
DROP TABLE portcoipojoinamt;
 +
CREATE TABLE portcoipojoinamt AS
 +
SELECT portcoipojoin.*, ipo.principalamt, ipo.proceedsamt, ipo.naiccode, ipo.zipcode, ipo.status, ipo.foundeddate, ipo.pkey
 +
FROM portcoipojoin LEFT JOIN iponoduplicates AS ipo ON portcoipojoin.file2issuer = ipo.issuer AND
 +
portcoipojoin.file2issuedate = ipo.issuedate AND portcoipojoin.statecode = ipo.statecode;
 +
 
 +
DROP TABLE portcoipojoinmajoin;
 +
CREATE TABLE portcoipojoinmajoin AS
 +
SELECT portco.*, matcher.file2targetname, matcher.file2announceddate
 +
FROM portcoipojoin AS portco LEFT JOIN masmatcheroutputwithexcludeflag AS matcher ON portco.coname = matcher.file1coname
 +
AND portco.statecode = matcher.file1statecode AND portco.datefirstinv = matcher.file1datefirstinv AND matcher.excludefinal
 +
= FALSE;
 +
--44774
 +
\COPY portcoipojoinmajoin TO 'portcoipojoinmajoin.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
For mas you will need to join to the masnodupes table because you will want acquiror name and transaction amount. The matcher file only contains the key without the extra information.
 +
DROP TABLE portcomajoinmasnodupes;
 +
CREATE TABLE portcomajoinmasnodupes AS
 +
SELECT portcomajoin.*, masnodupes.* 
 +
FROM portcomajoin LEFT JOIN masnodupes ON masnodupes.targetname = portcomajoin.file2targetname AND
 +
masnodupes.targetstatecode = portcomajoin.statecode AND masnodupes.announceddate = portcomajoin.file2announceddate;
 +
--44755
 +
 
 +
Then join with the portco table again and extract out the city and address.
 +
DROP TABLE portcoipomalocation;
 +
CREATE TABLE portcoipomalocation AS
 +
SELECT portcojoin.*, city, addr1, addr2, indclass, indsubgroup3, indminor
 +
FROM portcoipojoinmajoin AS portcojoin LEFT JOIN companybase ON portcojoin.coname = companybase.coname AND
 +
portcojoin.statecode = companybase.statecode;
 +
--44878
 +
\COPY portcoipomalocation TO 'portcoipomalocation.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
--now join the two ma and ipo tables together on coname, statecode, datefirstinv
 +
DROP TABLE portcoipoma;
 +
CREATE TABLE portcoipoma AS
 +
SELECT ma.*, ipo.file2issuer, ipo.file2issuedate, ipo.principalamt, ipo.proceedsamt, ipo.naiccode, ipo.zipcode, ipo.status, ipo.foundeddate 
 +
FROM portcomajoinmasnodupes AS ma JOIN portcoipojoinamt AS ipo ON ma.coname = ipo.coname AND ma.statecode = ipo.statecode AND ma.datefirstinv = ipo.datefirstinv;
 +
--44755
 +
\COPY portcoipoma TO 'portcoipoma.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
DROP TABLE portcoipomanodupes;
 +
CREATE TABLE portcoipomanodupes AS
 +
SELECT *
 +
FROM portcoipoma WHERE transactionamt IS NULL OR proceedsamt IS NULL;
 +
--44706
 +
\COPY portcoipomanodupes TO 'portcoipomanodupes.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
 
 +
==Joining geo with portcoipomas==
 +
Check the portcoipomas table first. Undisclosed Companies may have slipped through so you might have to rerun the joins against the companybase1 table selecting the appropriate flags. You will match portcoipomas with the geo data using coname, city, year so you will need to strip the year from the datefirstinv. Use the EXTRACT(YEAR FROM datefirstinv) command. I broke these into separate tables:
 +
CREATE TABLE portcoipomalocationexclude AS
 +
SELECT *, EXTRACT(YEAR FROM datefirstinv) AS datefirstyear
 +
FROM portcoipomalocation AS portco
 +
WHERE NOT EXISTS(SELECT * FROM excludeipomadupes as exclude WHERE (portco.coname = exclude.coname AND portco.statecode =
 +
exclude.statecode AND portco.datefirstinv = exclude.datefirstinv AND exclude.excludeflag = 1));
 +
 
 +
CREATE TABLE companybasegeo AS
 +
SELECT portco.*, geo.lattitude, geo.longitude, geo.noaddress
 +
FROM portcoipomalocationexclude as portco LEFT JOIN geo ON (portco.coname = geo.coname AND portco.city = geo.city AND
 +
portco.datefirstyear = geo.startyear);
 +
 
 +
You will notice that the output has blank entries for the noaddress flag. This indicates that the tables aren't joining correctly. Indeed you will see that the company Capella Systems, Inc. in the geo table is named Capella Systems Inc. in the portcoipomalocationexclude table. Therefore you will need to use the Matcher tool to match the keys in the two tables.
 +
 
 +
CREATE TABLE geomatcher (
 +
geoconame varchar(255),
 +
geocity varchar(100),
 +
geostartyear varchar(4),
 +
file2coname varchar(255),
 +
file2city varchar(100),
 +
file2datefirstyear varchar(4)
 +
);
 +
\COPY geomatcher FROM 'matcheroutputgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--43770
 +
 
 +
Geolookupbatch1 contains the first 2299 records. Geolookupbatch2 contains 6597 records. Geolookupbatch3 contains 2527 records. This sums to 11423 which matches the number of initial lookups?

Latest revision as of 13:44, 21 September 2020


Project
Venture Capital (Data)
Project logo 02.png
Project Information
Has title Venture Capital (Data)
Has owner Adrian Smart, Jake Silberman, Meghana Gaur
Has start date
Has deadline date
Has keywords Data
Has project status Subsume
Has sponsor McNair Center
Has project output Data, Tool
Copyright © 2019 edegan.com. All Rights Reserved.

NOTICE

This project page is largely redundant and needs to be cleaned up. The main project pages for this project are: Retrieving US VC Data From SDC, and VC Database Rebuild.


Project Objective

To create a master venture capital data set on which all other projects requiring venture capital data will be built.

Specifically, we will be building data to support the following:

  • Lists and counts by year as well as state, MSA, and city, zip code, etc.
  • Identify seed, early and later stage deals (first investment), dollars invested and rounds
  • Identify transactional VC dollars invested and rounds
  • Acquisitions and IPOs of VC backed firms
  • Identify when VC backed firms are "alive"
  • Geocode all portfolio companies
  • Geocode VC headquarters and branch offices
  • Compute distances between investors and their portfolio companies
  • Identify lead investors for portfolio companies
  • Identify top 100 VCs

Retrieving data

Old notes are Old Venture Capital Data Work

Retrieving US VC Data From SDC requires SDC platinum, perl scripts and SQL scripts, as well as some manual processing.

Scripts and other info

SDC and Perl Scripts are in:

E:\McNair\Projects\VC Database

This includes:

  • NormalizeFixedWidth.pl
  • RoundOnOneLine.pl - Needed for round on one line

SQL Scripts and finished data are in:

Z:\VentureCapitalData\SDCVCData

Notes:

  • Portfolio companies attributes to be extracted from roundlevel information
  • Round-on-one-line processed using RoundOnOneLine.pl and NormalizeFixedWidth.pl
  • Firms includes branch office so attributes must be extracted
  • Portfolio company descriptions - just the portco name, state, date of first inv, and the long description - has to be custom processed.

The next steps are detailed in VC Database Rebuild.


Additional datafiles (in E:\McNair\Projects\VC Database):

  • GeocodedVCData.txt 43,724 records, tab-delimited with companynames but with "none" for some geocoords.

Processing the base tables

The SQL scripts need to do the following:

  • Extract company infomation from roundbase then process it
    • Add variable for growth (1), transactional (2), or non-VC (3) (see below)
    • Create clean 5-digit ZIP field
  • Extract firm information from firmbase, then:
    • Add indicator variable for US
    • Create clean 5-digit ZIP field
  • Build out BranchOffice (BO) table
  • Build lookups for:
    • State Code
    • raisestatus

SQL file to process data is:

Z:/VentureCapitalData/SDCVC/ProcessData.sql

Lookup Tables

Stage of Investment

We are generating 6 binary indicators as follows:

Growth VC (4)
--------------
Seed (1)
Early Stage (2)
Later Stage (3)
Tranactional VC (5)
--------------
Acq. for Expansion
Acquisition
Bridge Loan
Expansion
Pending Acq
Recap or Turnaround
Exclude (non-VC) (6)
--------------
LBO
MBO
Open Market Purchase
PIPE
Secondary Buyout
VC Partnership
Other

Building companybase and round

Both are built from roundbase which contains portfolio companies and round details. The scripts are below.

CREATE TABLE companybase AS
SELECT DISTINCT 
coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,
statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip 
FROM roundbase
ORDER BY coname;
CREATE TABLE round AS
SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors
FROM roundbase
ORDER BY coname;

Building stageflags

Stageflags are built from the stage3 column of round. Using the tables above, I made 6 flags for seed, early, later, growth, transaction and exclude. The script is below. Verify that no row entries are missing flags in excel by copying the round table to a txt file. The last 3 columns should always sum to 1 for every row entry. There should be no zero sums and no sums greater than 1.

CREATE TABLE stageflags AS
SELECT coname, stage3,
CASE
 WHEN stage3 = 'Seed' THEN 1
 ELSE 0
 END AS seedflag,
CASE
 WHEN stage3 = 'Early Stage' THEN 1
 ELSE 0
 END AS earlyflag,
CASE
 WHEN stage3 = 'Later Stage' THEN 1
 ELSE 0
 END AS laterflag,
CASE
 WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1
 ELSE 0
 END AS growthflag,
CASE
 WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1
 ELSE 0
 END AS transactionflag,
CASE
 WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' 
OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1
 ELSE 0
 END AS excludeflag
FROM round;

Sample output

The psql file was copied to an output text file using:

\COPY stageflags TO 'flag.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV; 

A sum column was added in excel which is the sum of the last 3 columns. Notice how all entries are 1. This is good.

Checking companybase table

The following psql code creates a new table called companybase1 that checks for Undisclosed companies and nonUS companies in the companybase table and also inserts a primary key column which is a concatenation of coname, statecode and datefirstinv.

CREATE TABLE companybase1 AS 
SELECT *,
concat(coname::text, statecode::text, datefirstinv::text) AS pkey,
CASE
 WHEN nationcode = 'US' THEN 1::int
 ELSE 0::int
 END AS alwaysusflag,
CASE
 WHEN coname = 'Undisclosed Company' THEN 1::int
 ELSE 0::int
 END AS undisclosedflag
FROM companybase;
\COPY companybase1 TO 'companybase1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV;

Checking if pkey is valid

The following psql code creates a new table called companybasepkeyvalid that counts the number of times the pkey created in companybase1 occurs in companybase1. In this case there are 3 pkeys that are not valid for companies: Undisclosed Company, New York Digital Health LLC and Undisclosed Company.

CREATE TABLE companybasepkeyvalid AS
SELECT coname, pkey, COUNT(pkey)
FROM companybase1
GROUP BY coname, pkey;
--44771
\COPY companybasepkeyvalid TO 'companybasepkeyvalid.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV;

The Undisclosed Companies will be updated manually using the description to extract the company name. To remove the duplicate records of the company New York Digital Health LLC from the roundbase table run the following command:

DELETE FROM roundbase WHERE url = 'www.digitalhealthaccelerator.c';
--1

After the Undisclosed company names are updated, you will need to recreate the companybase and roundbase tables. Copy all the undisclosed companies to a text file and pull out the company names from the description.

DROP TABLE undisclosedcompanies
CREATE TABLE undisclosedcompanies AS
SELECT coname, description, url  FROM roundbase
WHERE coname = 'Undisclosed Company';
--37
\COPY undisclosedcompanies TO 'undisclosedcompanies.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV;

After manually pulling out the company names from the description from all 37 entries, I created the following commands to update the original roundbase table below. After the updates there are still 18 Undisclosed companies. Recheck to make sure there are no primary key issues again.

UPDATE roundbase SET coname = 'Standard Bariatrics' WHERE coname = 'Undisclosed Company' AND description = 'Standard 
Bariatrics is taking bariatric procedure and creating instruments';
--1
UPDATE roundbase SET coname = 'Jersey Watch' WHERE coname = 'Undisclosed Company' AND description = 'Jersey Watch provides 
organizers to professional.';
--1
UPDATE roundbase SET coname = 'CareSkore' WHERE coname = 'Undisclosed Company' AND description = 'CareSkore provides an 
evidence-based predictive systems.';
--1
UPDATE roundbase SET coname = 'Navistone' WHERE coname = 'Undisclosed Company' AND description = 'Navistone provides digital 
marketers a way to target the right web browsers';
--2
UPDATE roundbase SET coname = 'PolySync' WHERE coname = 'Undisclosed Company' AND description = 'Offers PolySync platform 
for driverless vehicle applications.';
--2
UPDATE roundbase SET coname = 'Platform9' WHERE coname = 'Undisclosed Company' AND description = 'Platform9 Systems Inc is a 
United States-based company.';
--2
UPDATE roundbase SET coname = 'Salamanca' WHERE coname = 'Undisclosed Company' AND description = 'Salamanca provides 
products and services.';
--1
UPDATE roundbase SET coname = 'Sourcegraph Inc.' WHERE coname = 'Undisclosed Company' AND description = 'Sourcegraph Inc. 
provides a code search engine for open source.';
--1
UPDATE roundbase SET coname = 'Sona' WHERE coname = 'Undisclosed Company' AND description = 'Sona provides products and 
services.';
--1
UPDATE roundbase SET coname = 'AlphaSource' WHERE coname = 'Undisclosed Company' AND description = 'AlphaSource in provides 
healthcare technology solutions.';
--2
UPDATE roundbase SET coname = 'Reflecktive' WHERE coname = 'Undisclosed Company' AND description = 'Reflecktive provides 
products and services.';
--1
UPDATE roundbase SET coname = 'SuperH' WHERE coname = 'Undisclosed Company' AND description = 'SuperH provides an SaaS 
platform for open source data';
--1
UPDATE roundbase SET coname = 'Islands Media Inc' WHERE coname = 'Undisclosed Company' AND description = 'Islands Media Inc 
is an IT company.';
--1
UPDATE roundbase SET coname = 'Envoy Commerce Corp.' WHERE coname = 'Undisclosed Company' AND description = 'Envoy Commerce 
Corp. provides online grocery delivery service.';
--1
UPDATE roundbase SET coname = 'Quilt' WHERE coname = 'Undisclosed Company' AND description = 'Quilt is a consumer-centric 
insurance company focused on renters insurances';
--1

Removing duplicates from ipos table

If you create the following two tables you'll notice that there are duplicate entries in the ipos table.

DROP TABLE ipospkey;
CREATE TABLE ipospkey AS
SELECT *,
concat(issuer::text, issuedate::text, statecode::text, principalamt::text) AS pkey
FROM ipos;
--10440
DROP TABLE iposcount;
CREATE TABLE iposcount AS
SELECT issuer, issuedate, statecode, COUNT(pkey)
FROM ipospkey
GROUP BY issuer, issuedate, statecode;
--9491

To clean these up I built the following tables and selected the max principalamt values if there were duplicates. There is probably a simpler way to do this but I'll leave that for the next person to discover.

DROP TABLE iposcounthigh;
CREATE TABLE iposcounthigh AS
SELECT *
FROM iposcount
WHERE count > 1; 
DROP TABLE duplicateipodata;
CREATE TABLE duplicateipodata AS
SELECT ipos.*
FROM ipos INNER JOIN iposcounthigh ON (ipos.issuer = iposcounthigh.issuer AND ipos.issuedate = iposcounthigh.issuedate AND 
ipos.statecode = iposcounthigh.statecode)
ORDER BY principalamt DESC;
DROP TABLE ipoduplicatemin;
CREATE TABLE ipoduplicatemin AS
SELECT issuedate, issuer, statecode, MIN(principalamt) AS principalamt 
FROM duplicateipodata
GROUP BY issuedate, issuer, statecode;
--939
\COPY ipoduplicatemin TO 'ipoduplicatemin.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
DROP TABLE ipoduplicatekeys;
CREATE TABLE ipoduplicatekeys AS
SELECT *,
concat(issuer::text, issuedate::text, statecode::text, principalamt::text) AS pkey
FROM ipoduplicatemin;
DROP TABLE iponoduplicates;
CREATE TABLE iponoduplicates AS
SELECT ipospkey.*
FROM ipospkey
WHERE NOT EXISTS(SELECT * FROM ipoduplicatekeys WHERE (ipospkey.pkey = ipoduplicatekeys.pkey)); 
--9490

To verify that there are no more duplicates create another table.

DROP TABLE iposcountdupes1;
CREATE TABLE iposcountdupes1 AS
SELECT issuer, issuedate, statecode, COUNT(pkey)
FROM iponoduplicates
GROUP BY issuer, issuedate, statecode;
\COPY iposcountdupes1 TO 'iposcountdupes1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV

You will notice that there are still 9 duplicate records that slipped thru somehow. Remove these manually.

DELETE FROM iponoduplicates WHERE pkey = 'PacTel Corp1993-12-02CA$299.00';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Templeton Dragon Fund Inc1994-09-21FL$169.50';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Sterling Commerce1996-03-08TX$240.00' AND proceedsamt = '288';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Sothebys Holdings Inc1988-05-13NY$27.00';
--1
DELETE FROM iponoduplicates WHERE pkey = 'TD Waterhouse Group Inc1999-06-23NY$655.20';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Berlitz International Inc1989-12-13NJ$34.65';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Spain Fund Inc1988-06-21NY$39.00';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Ultramar Corp1992-06-26CT$85.39';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Goldman Sachs Group Inc1999-05-03NY$424.00';
--1

iponoduplicates is now the master ipo table that contains unique keys which are concatenated from issuer, issuedate and prinicpalamt. It should contain 9481 lines.

Cleaning mas table

There were duplicate records in this file because the original text file from SDC contained addresses that were on two rows for some companies. Therefore the easiest way to get rid of these records is to remove lines in textpad and reimport into the mas table.

Remove lines in textpad using regex: ^#.*\n and replacing with nothing.

After reimporting there will still be duplicate rows. Slap an id number on them and take the min as shown below.

CREATE TABLE mas1 AS
SELECT *,
concat(targetname::text, targetstate::text, announceddate::text) AS pkey
FROM mas;
ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;
--114890
DROP TABLE masminid;
CREATE TABLE masminid AS
SELECT targetname, targetstate, announceddate, MIN(id)
FROM mas1
GROUP BY targetname, targetstate, announceddate;
--114825
DROP TABLE masnodupes;
CREATE TABLE masnodupes AS
SELECT mas1.*
FROM mas1 JOIN masminid ON mas1.id = masminid.min;
\COPY masnodupes TO 'masnodupes.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV

masnodupes should have no duplicates based on targetname, targetstate, announceddate. After running it thru the matcher you will get about 800 matching errors. We decided to exclude any rows that had equivalent targetname, targetstate where the announced dates fell within the same week. In this case we took the minimum date value and excluded the other row. Below are the queries to make those tables.

DROP TABLE masdistinctkeys;
CREATE TABLE masdistinctkeys AS
SELECT DISTINCT targetname, targetstate, announceddate
FROM masnodupes;
--114825
\COPY masdistinctkeys TO 'masdistinctkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
DROP TABLE masdistinctkeysmindates;
CREATE TABLE masdistinctkeysmindates AS
SELECT targetname, targetstate, MIN(announceddate) AS announceddate
FROM masdistinctkeys
GROUP BY targetname, targetstate;
--113236
DROP TABLE masdistinctkeysdatewindow;
CREATE TABLE masdistinctkeysdatewindow AS
SELECT masdistinctkeys.*, masdistinctkeysmindates.announceddate as minanndate,
CASE WHEN masdistinctkeys.announceddate - INTERVAL '7 day' > masdistinctkeysmindates.announceddate THEN 1::int
ELSE 0::int
END AS dateflag
FROM masdistinctkeys JOIN masdistinctkeysmindates ON (masdistinctkeys.targetname = masdistinctkeysmindates.targetname AND 
masdistinctkeys.targetstate = masdistinctkeysmindates.targetstate);
--114825
\COPY masdistinctkeysdatewindow TO 'masdistinctkeysdatewindow.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV 
CREATE TABLE masdistinctkeys1 AS
SELECT targetname, targetstate, announceddate
FROM masdistinctkeysdatewindow
WHERE dateflag = 0;
--113267
\COPY masdistinctkeys1 TO 'masdistinctkeys1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV

Now when you re-run the matcher on portcokeys and masdistinctkeys1 you will get 428 errors and excluding Undisclosed fields gets you down to 356.

In Excel create a column to exclude any entries where the announced date occurs before the firstinvdate. Afterwards, manually flag the other duplicate mismatches picking the earlier announced date.

Import Flagged Matcher Data Into PSQL

After everything in the matcher output files is manually flagged create two new tables in Excel for iposmatcherouput and masmatcheroutput and copy those files into the tables.

CREATE TABLE masmatcheroutputwithexcludeflag(
 file1coname varchar(255),
 file2targetname varchar(255),
 warningmsg varchar(255),
 file1coname1 varchar(255),
 file1statecode varchar(2),
 file1datefirstinv date, --mm-dd-yyyy
 file2targetname2 varchar(255),
 file2statecode varchar(2),
 file2announceddate date,
 exclude boolean,
 excludemaster boolean,
 excludemanual boolean,
 excludemanualmaster boolean,
 excludefinal boolean
);
\COPY masmatcheroutputwithexcludeflag FROM 'masmatcheroutput-nohead.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--9400
CREATE TABLE ipomatcheroutputwithexcludeflag(
 file1coname varchar(255),
 file2issuer varchar(255),
 warningmsg varchar(255),
 file1coname1 varchar(255),
 file1statecode varchar(2),
 file1datefirstinv date, --mm-dd-yyyy
 file2issuer1 varchar(255),
 file2issuedate date,
 file2statecode varchar(2),
 excludedate boolean,
 excludemaster boolean,
 excludemanual boolean,
 excludemanualmaster boolean,
 excludefinal boolean
);
\COPY ipomatcheroutputwithexcludeflag FROM 'iposmatcheroutput-nohead.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--2591

Joining portco with ipo and mas

In order to match the company with the ipo or acquisition date we need to join the three tables together. Do this in a two step process. First join with ipo and then take the resulting table and join with mas. This will give you companies with an ipo column and acquisition column. There will be a lot of blanks because some companies didn't have ipos or acquisitions.

DROP TABLE portcoipojoin;
CREATE TABLE portcoipojoin AS
SELECT coname, statecode, datefirstinv, matcher.file2issuer, matcher.file2issuedate
FROM companybase LEFT JOIN ipomatcheroutputwithexcludeflag AS matcher ON companybase.coname = matcher.file1coname AND 
companybase.statecode = matcher.file1statecode AND companybase.datefirstinv = matcher.file1datefirstinv AND 
matcher.excludefinal = FALSE; 
--44773
\COPY portcoipojoin TO 'portcoipojoin.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV

Now join with the iposnoduplicates table to grab proceedamt and principalamt, etc.

DROP TABLE portcoipojoinamt;
CREATE TABLE portcoipojoinamt AS
SELECT portcoipojoin.*, ipo.principalamt, ipo.proceedsamt, ipo.naiccode, ipo.zipcode, ipo.status, ipo.foundeddate, ipo.pkey
FROM portcoipojoin LEFT JOIN iponoduplicates AS ipo ON portcoipojoin.file2issuer = ipo.issuer AND 
portcoipojoin.file2issuedate = ipo.issuedate AND portcoipojoin.statecode = ipo.statecode;
DROP TABLE portcoipojoinmajoin;
CREATE TABLE portcoipojoinmajoin AS
SELECT portco.*, matcher.file2targetname, matcher.file2announceddate
FROM portcoipojoin AS portco LEFT JOIN masmatcheroutputwithexcludeflag AS matcher ON portco.coname = matcher.file1coname 
AND portco.statecode = matcher.file1statecode AND portco.datefirstinv = matcher.file1datefirstinv AND matcher.excludefinal 
= FALSE;
--44774
\COPY portcoipojoinmajoin TO 'portcoipojoinmajoin.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV

For mas you will need to join to the masnodupes table because you will want acquiror name and transaction amount. The matcher file only contains the key without the extra information.

DROP TABLE portcomajoinmasnodupes;
CREATE TABLE portcomajoinmasnodupes AS
SELECT portcomajoin.*, masnodupes.*  
FROM portcomajoin LEFT JOIN masnodupes ON masnodupes.targetname = portcomajoin.file2targetname AND 
masnodupes.targetstatecode = portcomajoin.statecode AND masnodupes.announceddate = portcomajoin.file2announceddate;
--44755

Then join with the portco table again and extract out the city and address.

DROP TABLE portcoipomalocation;
CREATE TABLE portcoipomalocation AS
SELECT portcojoin.*, city, addr1, addr2, indclass, indsubgroup3, indminor
FROM portcoipojoinmajoin AS portcojoin LEFT JOIN companybase ON portcojoin.coname = companybase.coname AND 
portcojoin.statecode = companybase.statecode;
--44878
\COPY portcoipomalocation TO 'portcoipomalocation.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV

--now join the two ma and ipo tables together on coname, statecode, datefirstinv DROP TABLE portcoipoma; CREATE TABLE portcoipoma AS SELECT ma.*, ipo.file2issuer, ipo.file2issuedate, ipo.principalamt, ipo.proceedsamt, ipo.naiccode, ipo.zipcode, ipo.status, ipo.foundeddate FROM portcomajoinmasnodupes AS ma JOIN portcoipojoinamt AS ipo ON ma.coname = ipo.coname AND ma.statecode = ipo.statecode AND ma.datefirstinv = ipo.datefirstinv; --44755 \COPY portcoipoma TO 'portcoipoma.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV

DROP TABLE portcoipomanodupes; CREATE TABLE portcoipomanodupes AS SELECT * FROM portcoipoma WHERE transactionamt IS NULL OR proceedsamt IS NULL; --44706 \COPY portcoipomanodupes TO 'portcoipomanodupes.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV

Joining geo with portcoipomas

Check the portcoipomas table first. Undisclosed Companies may have slipped through so you might have to rerun the joins against the companybase1 table selecting the appropriate flags. You will match portcoipomas with the geo data using coname, city, year so you will need to strip the year from the datefirstinv. Use the EXTRACT(YEAR FROM datefirstinv) command. I broke these into separate tables:

CREATE TABLE portcoipomalocationexclude AS
SELECT *, EXTRACT(YEAR FROM datefirstinv) AS datefirstyear
FROM portcoipomalocation AS portco
WHERE NOT EXISTS(SELECT * FROM excludeipomadupes as exclude WHERE (portco.coname = exclude.coname AND portco.statecode = 
exclude.statecode AND portco.datefirstinv = exclude.datefirstinv AND exclude.excludeflag = 1));
CREATE TABLE companybasegeo AS
SELECT portco.*, geo.lattitude, geo.longitude, geo.noaddress
FROM portcoipomalocationexclude as portco LEFT JOIN geo ON (portco.coname = geo.coname AND portco.city = geo.city AND 
portco.datefirstyear = geo.startyear);

You will notice that the output has blank entries for the noaddress flag. This indicates that the tables aren't joining correctly. Indeed you will see that the company Capella Systems, Inc. in the geo table is named Capella Systems Inc. in the portcoipomalocationexclude table. Therefore you will need to use the Matcher tool to match the keys in the two tables.

CREATE TABLE geomatcher (
geoconame varchar(255),
geocity varchar(100),
geostartyear varchar(4),
file2coname varchar(255),
file2city varchar(100),
file2datefirstyear varchar(4)
);
\COPY geomatcher FROM 'matcheroutputgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--43770

Geolookupbatch1 contains the first 2299 records. Geolookupbatch2 contains 6597 records. Geolookupbatch3 contains 2527 records. This sums to 11423 which matches the number of initial lookups?