<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=AdrianS</id>
	<title>edegan.com - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=AdrianS"/>
	<link rel="alternate" type="text/html" href="http://www.edegan.com/wiki/Special:Contributions/AdrianS"/>
	<updated>2026-05-12T22:55:37Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19788</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19788"/>
		<updated>2017-08-04T22:03:25Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Creating portcokey, roundline, firm table */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
 E:\McNair\Projects\VC Database&lt;br /&gt;
Database can be started by typing psql vcdb2&lt;br /&gt;
The file containing all the SQL queries used to build vcdb2 is located in the Z drive and named ProcessData2.sql. &lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessData2.sql&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
I also added flags on the geodata table to filter points outside the US. You can use the geoallcoords1 table instead of geoallcoords and set excludeflag = 1 to filter out 292 erroneous points when you create your colevel tables.&lt;br /&gt;
 CREATE TABLE geoallcoords1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN longitude &amp;lt; -125 OR longitude &amp;gt; -66 OR latitude &amp;lt; 24 OR latitude &amp;gt; 50 OR latitude = NULL OR longitude = NULL THEN 1::int ELSE &lt;br /&gt;
 0::int END AS excludeflag&lt;br /&gt;
 FROM geoallcoords;&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM geoallcoords1 WHERE excludeflag = 1;&lt;br /&gt;
 --292&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue. After fixing and reimporting the roundline into vcdb2 there are still 1,963 funds that appear in roundline that are not in fundbasecore.&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL)as r;&lt;br /&gt;
 --1963&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
    &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19091&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17128&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating portcoroundlinefirmmaster==&lt;br /&gt;
portcoroundlinefirmmaster table contains portcokey, roundline, firm table.&lt;br /&gt;
 CREATE TABLE portcoroundlinefirmmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.rounddate, r.amountk, r.fundname, f.firmname&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv&lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname&lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname;&lt;br /&gt;
 --299321&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 DROP TABLE roundline1;&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19787</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19787"/>
		<updated>2017-08-04T21:56:12Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Joining funds, firms, roundline with companybasecore */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
 E:\McNair\Projects\VC Database&lt;br /&gt;
Database can be started by typing psql vcdb2&lt;br /&gt;
The file containing all the SQL queries used to build vcdb2 is located in the Z drive and named ProcessData2.sql. &lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessData2.sql&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
I also added flags on the geodata table to filter points outside the US. You can use the geoallcoords1 table instead of geoallcoords and set excludeflag = 1 to filter out 292 erroneous points when you create your colevel tables.&lt;br /&gt;
 CREATE TABLE geoallcoords1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN longitude &amp;lt; -125 OR longitude &amp;gt; -66 OR latitude &amp;lt; 24 OR latitude &amp;gt; 50 OR latitude = NULL OR longitude = NULL THEN 1::int ELSE &lt;br /&gt;
 0::int END AS excludeflag&lt;br /&gt;
 FROM geoallcoords;&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM geoallcoords1 WHERE excludeflag = 1;&lt;br /&gt;
 --292&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue. After fixing and reimporting the roundline into vcdb2 there are still 1,963 funds that appear in roundline that are not in fundbasecore.&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL)as r;&lt;br /&gt;
 --1963&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
    &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19091&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17128&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating portcokey, roundline, firm table==&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 DROP TABLE roundline1;&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19786</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19786"/>
		<updated>2017-08-04T21:54:58Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Joining funds with roundline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
 E:\McNair\Projects\VC Database&lt;br /&gt;
Database can be started by typing psql vcdb2&lt;br /&gt;
The file containing all the SQL queries used to build vcdb2 is located in the Z drive and named ProcessData2.sql. &lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessData2.sql&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
I also added flags on the geodata table to filter points outside the US. You can use the geoallcoords1 table instead of geoallcoords and set excludeflag = 1 to filter out 292 erroneous points when you create your colevel tables.&lt;br /&gt;
 CREATE TABLE geoallcoords1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN longitude &amp;lt; -125 OR longitude &amp;gt; -66 OR latitude &amp;lt; 24 OR latitude &amp;gt; 50 OR latitude = NULL OR longitude = NULL THEN 1::int ELSE &lt;br /&gt;
 0::int END AS excludeflag&lt;br /&gt;
 FROM geoallcoords;&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM geoallcoords1 WHERE excludeflag = 1;&lt;br /&gt;
 --292&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue. After fixing and reimporting the roundline into vcdb2 there are still 1,963 funds that appear in roundline that are not in fundbasecore.&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL)as r;&lt;br /&gt;
 --1963&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
    &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19091&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17128&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 DROP TABLE roundline1;&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19785</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19785"/>
		<updated>2017-08-04T21:46:32Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning roundline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
 E:\McNair\Projects\VC Database&lt;br /&gt;
Database can be started by typing psql vcdb2&lt;br /&gt;
The file containing all the SQL queries used to build vcdb2 is located in the Z drive and named ProcessData2.sql. &lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessData2.sql&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
I also added flags on the geodata table to filter points outside the US. You can use the geoallcoords1 table instead of geoallcoords and set excludeflag = 1 to filter out 292 erroneous points when you create your colevel tables.&lt;br /&gt;
 CREATE TABLE geoallcoords1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN longitude &amp;lt; -125 OR longitude &amp;gt; -66 OR latitude &amp;lt; 24 OR latitude &amp;gt; 50 OR latitude = NULL OR longitude = NULL THEN 1::int ELSE &lt;br /&gt;
 0::int END AS excludeflag&lt;br /&gt;
 FROM geoallcoords;&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM geoallcoords1 WHERE excludeflag = 1;&lt;br /&gt;
 --292&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue.  &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19677&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17089&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 DROP TABLE roundline1;&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19784</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19784"/>
		<updated>2017-08-04T21:30:33Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* File locations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
 E:\McNair\Projects\VC Database&lt;br /&gt;
Database can be started by typing psql vcdb2&lt;br /&gt;
The file containing all the SQL queries used to build vcdb2 is located in the Z drive and named ProcessData2.sql. &lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessData2.sql&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
I also added flags on the geodata table to filter points outside the US. You can use the geoallcoords1 table instead of geoallcoords and set excludeflag = 1 to filter out 292 erroneous points when you create your colevel tables.&lt;br /&gt;
 CREATE TABLE geoallcoords1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN longitude &amp;lt; -125 OR longitude &amp;gt; -66 OR latitude &amp;lt; 24 OR latitude &amp;gt; 50 OR latitude = NULL OR longitude = NULL THEN 1::int ELSE &lt;br /&gt;
 0::int END AS excludeflag&lt;br /&gt;
 FROM geoallcoords;&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM geoallcoords1 WHERE excludeflag = 1;&lt;br /&gt;
 --292&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue.  &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19677&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17089&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19783</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19783"/>
		<updated>2017-08-04T21:30:07Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* File locations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
 E:\McNair\Projects\VC Database&lt;br /&gt;
Database can be started by typing psql vcdb2&lt;br /&gt;
The file containing all the SQL queries used to build vcdb2 is located in the Z drive and named ProcessData2.sql. &lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
I also added flags on the geodata table to filter points outside the US. You can use the geoallcoords1 table instead of geoallcoords and set excludeflag = 1 to filter out 292 erroneous points when you create your colevel tables.&lt;br /&gt;
 CREATE TABLE geoallcoords1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN longitude &amp;lt; -125 OR longitude &amp;gt; -66 OR latitude &amp;lt; 24 OR latitude &amp;gt; 50 OR latitude = NULL OR longitude = NULL THEN 1::int ELSE &lt;br /&gt;
 0::int END AS excludeflag&lt;br /&gt;
 FROM geoallcoords;&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM geoallcoords1 WHERE excludeflag = 1;&lt;br /&gt;
 --292&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue.  &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19677&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17089&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19781</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19781"/>
		<updated>2017-08-04T21:28:37Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Fixing erroneous geo-coordinates */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
 E:\McNair\Projects\VC Database&lt;br /&gt;
Database can be started by typing psql vcdb2&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
I also added flags on the geodata table to filter points outside the US. You can use the geoallcoords1 table instead of geoallcoords and set excludeflag = 1 to filter out 292 erroneous points when you create your colevel tables.&lt;br /&gt;
 CREATE TABLE geoallcoords1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN longitude &amp;lt; -125 OR longitude &amp;gt; -66 OR latitude &amp;lt; 24 OR latitude &amp;gt; 50 OR latitude = NULL OR longitude = NULL THEN 1::int ELSE &lt;br /&gt;
 0::int END AS excludeflag&lt;br /&gt;
 FROM geoallcoords;&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM geoallcoords1 WHERE excludeflag = 1;&lt;br /&gt;
 --292&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue.  &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19677&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17089&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19780</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19780"/>
		<updated>2017-08-04T21:25:39Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Loading starting data into database */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
 E:\McNair\Projects\VC Database&lt;br /&gt;
Database can be started by typing psql vcdb2&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue.  &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19677&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17089&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19777</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19777"/>
		<updated>2017-08-04T21:22:10Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* File locations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
 E:\McNair\Projects\VC Database&lt;br /&gt;
Database can be started by typing psql vcdb2&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue.  &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19677&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17089&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19776</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19776"/>
		<updated>2017-08-04T21:21:29Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* File locations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
 Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
 E:\McNair\Projects\VC Database&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue.  &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19677&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17089&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19775</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19775"/>
		<updated>2017-08-04T21:20:54Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Plan */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==File locations==&lt;br /&gt;
Database files are located here:&lt;br /&gt;
Z:\VentureCapitalData\SDCVCData\vcdb2&lt;br /&gt;
SDC files are located here and the normalized versions are copied into the Z folder above:&lt;br /&gt;
E:\McNair\Projects\VC Database&lt;br /&gt;
&lt;br /&gt;
==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue.  &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19677&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17089&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19774</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19774"/>
		<updated>2017-08-04T21:16:26Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning roundline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue.  &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19677&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17089&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Redoing the companybase with the new SDC company data==&lt;br /&gt;
Since we did another pull in SDC to get the correct city and addresses. We need to update the companybasecore table which means we need to clean the new companybase. Then this will recreate the roundplus and roundlevel outputs.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44996&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase)a;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybase1;&lt;br /&gt;
 CREATE TABLE sdccompanybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM sdccompanybase;&lt;br /&gt;
 --44997&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE sdccompanybasecore;&lt;br /&gt;
 CREATE TABLE sdccompanybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,lastupdated,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indcla&lt;br /&gt;
 ss,indsubgroup,indminorgroup,url,zip&lt;br /&gt;
 FROM sdccompanybase1 WHERE nationcode = 'US' AND undisclosedflag=0;&lt;br /&gt;
 --44966&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM sdccompanybasecore)a;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN sdccompanybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --142999&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22374&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22374&lt;br /&gt;
 \COPY roundleveloutput2 TO 'roundleveloutput2.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19772</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19772"/>
		<updated>2017-08-04T21:02:15Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Joining funds with roundline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
There is a lot of mismatches between funds and roundline. After investigation, it seems that some of the fundnames were altered in the roundline data. We will need to fix the RoundOnOneLine.pl and Normalize.pl scripts to fix this issue.  &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM roundline1 WHERE undisclosedflag=0;&lt;br /&gt;
 --19677&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname)) AS t;&lt;br /&gt;
 --17089&lt;br /&gt;
&lt;br /&gt;
 --Look at the ones that don't match&lt;br /&gt;
 SELECT a.fundname, b.fundname FROM (SELECT DISTINCT fundname FROM &lt;br /&gt;
 roundline1 WHERE undisclosedflag=0) AS A LEFT JOIN (select distinct fundname FROM fundbasecore) AS B ON a.fundname=b.fundname WHERE &lt;br /&gt;
 b.fundname IS NULL;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasecore;&lt;br /&gt;
 --27044&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.fundname, b.firmname FROM (SELECT DISTINCT fundname, firmname FROM &lt;br /&gt;
 fundbasecore) AS A JOIN (select distinct firmname FROM firmbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --26910&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasecore;&lt;br /&gt;
 --14103&lt;br /&gt;
&lt;br /&gt;
 SELECT count(*) FROM ((SELECT a.firmname, b.firmname FROM (SELECT DISTINCT firmname FROM  &lt;br /&gt;
 firmbasecore) AS A JOIN (select distinct firmname FROM fundbasecore) AS B ON a.firmname=b.firmname)) AS t;&lt;br /&gt;
 --14084&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19748</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19748"/>
		<updated>2017-08-04T16:28:18Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Fixing erroneous geo-coordinates */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way. Then copy into the db and copy out all the distinct coname, statecode, datefirstinv that have a value in addr1 or addr2. Then run this through the geocode script. Copy the result back into the db and redo the colevel output tables.&lt;br /&gt;
 DROP TABLE geoallcoords;&lt;br /&gt;
 CREATE TABLE geoallcoords (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoallcoords FROM 'sdccompanygeolookup.txt_coords' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --44999&lt;br /&gt;
&lt;br /&gt;
 --redo the colevel output tables&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
  LEFT JOIN geoallcoords AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = companybasecore2.datefirstinv  &lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31523&lt;br /&gt;
 \COPY colevelsimple TO 'colevelsimple.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19744</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19744"/>
		<updated>2017-08-04T15:40:59Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Fixing erroneous geo-coordinates */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
There are still geo errors in the db. Addresses within the US have incorrect geo-coordinates. To fix this problem we will just lookup all the addresses in the DB using the Geocode.py script. Also we need to pull a company level file from SDC because the addresses will be copied down or be null by the normalizer. Modify your round ssh sdc script to remove the round dates. Therefore only one line will be assigned to one company. There will be no normalization errors this way.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19736</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19736"/>
		<updated>2017-08-03T22:13:31Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Joining funds with roundline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
&lt;br /&gt;
==Creating portcoexitmaster==&lt;br /&gt;
Portcoexitmaster contains the portcokey with an exitflag, ipoflag and maflag and an exit value. It is built off the companybaseipomasmaster table so be sure you've built this first.&lt;br /&gt;
 CREATE TABLE portcoexitbuild AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, ipoissuedate, masannounceddate, ipoprincipalamtk, mastransactionamtk&lt;br /&gt;
 FROM companybaseipomasmaster;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE portcoexitmaster;&lt;br /&gt;
 CREATE TABLE portcoexitmaster AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN 1::int ELSE 0::int END AS ipoflag,&lt;br /&gt;
 CASE WHEN masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS maflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL OR masannounceddate IS NOT NULL THEN 1::int ELSE 0::int END AS exitflag,&lt;br /&gt;
 CASE WHEN ipoissuedate IS NOT NULL THEN ipoprincipalamtk::numeric::float8/1000 ELSE mastransactionamtk::numeric::float8/1000 END AS &lt;br /&gt;
 exitvaluem&lt;br /&gt;
 FROM portcoexitbuild;&lt;br /&gt;
 \COPY portcoexitmaster TO 'portcoexitmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19734</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19734"/>
		<updated>2017-08-03T21:07:02Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Joining funds, firms, roundline with companybasecore */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
Now count the distinct firms, funds and portcokeys&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --8956&lt;br /&gt;
 &lt;br /&gt;
 SELECT COUNT(DISTINCT fundname) FROM fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 --16907&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM fundbasefirmbaseroundlinegoodkeys) as gfoo;&lt;br /&gt;
 --42093&lt;br /&gt;
You can see that there are many keys that do not exist in the other datasets.&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19733</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19733"/>
		<updated>2017-08-03T21:00:54Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Joining funds with roundline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
&lt;br /&gt;
==Joining funds, firms, roundline with companybasecore==&lt;br /&gt;
 DROP TABLE fundbasefirmbaseroundlinegoodkeys;&lt;br /&gt;
 CREATE TABLE fundbasefirmbaseroundlinegoodkeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, r.coname AS rconame, r.statecode AS rstatecode, r.datefirstinv AS rdatefirstinv, &lt;br /&gt;
 fu.fundname, f.firmname, f.location &lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 INNER JOIN roundline1 AS r ON c.coname = r.coname AND c.statecode = r.statecode AND c.datefirstinv = r.datefirstinv &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON fu.fundname = r.fundname &lt;br /&gt;
 INNER JOIN firmbasecore AS f ON f.firmname = fu.firmname&lt;br /&gt;
 WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
 --298688&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19703</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19703"/>
		<updated>2017-08-02T21:44:18Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning roundline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;br /&gt;
Add a flag for undisclosed funds.&lt;br /&gt;
 CREATE TABLE roundline1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname = 'Undisclosed Fund' THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS undisclosedflag&lt;br /&gt;
 FROM roundline;&lt;br /&gt;
 --385753&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19680</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19680"/>
		<updated>2017-08-02T17:09:00Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Joining firms with funds */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Joining funds with roundline==&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19679</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19679"/>
		<updated>2017-08-02T16:50:22Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Joining firms with funds */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
 DROP TABLE firmfundstestjoin;&lt;br /&gt;
 CREATE TABLE firmfundstestjoin AS&lt;br /&gt;
 SELECT f.firmname AS firmfirmname, fu.firmname AS fundsfirmname &lt;br /&gt;
 FROM firmbasecore AS f &lt;br /&gt;
 INNER JOIN fundbasecore AS fu ON f.firmname = fu.firmname WHERE fu.firmname != 'Undisclosed Firm';&lt;br /&gt;
If you do the full join you will notice that there are 30 firms in the funds table that do not exist in the firms table.&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19675</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19675"/>
		<updated>2017-08-02T16:25:05Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Name based matching firms to funds */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
==Joining firms with funds==&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19666</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19666"/>
		<updated>2017-08-01T21:21:56Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning firmbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
Instead we chose to use only firmname as the key because there were not too many duplicates. We remove the duplicates by selecting the lesser foundingdate.&lt;br /&gt;
 DROP TABLE firmbaseduplicates;&lt;br /&gt;
 CREATE TABLE firmbaseduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT firmname FROM firmbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY firmname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbaseinclude;&lt;br /&gt;
 CREATE TABLE firmbaseinclude AS&lt;br /&gt;
 SELECT f.firmname, MAX(f.foundingdate) AS foundingdate&lt;br /&gt;
 FROM firmbase1 AS f&lt;br /&gt;
 INNER JOIN firmbaseduplicates AS d ON f.firmname = d.firmname&lt;br /&gt;
 GROUP BY f.firmname;&lt;br /&gt;
 --12&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmbasecore;&lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT l.* &lt;br /&gt;
 FROM firmbase1 AS l &lt;br /&gt;
 LEFT JOIN firmbaseinclude AS r ON r.firmname = l.firmname AND r.foundingdate = l.foundingdate&lt;br /&gt;
 WHERE r.firmname IS NULL AND undisclosedflag = 0;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(DISTINCT firmname) FROM firmbasecore;&lt;br /&gt;
 --14133&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19663</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19663"/>
		<updated>2017-08-01T20:28:49Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning fundbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key. But we're going to use simply the fundname as a key because it will be easier to do join operations later.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27050&lt;br /&gt;
The plan is to grab all the duplicate fundnames and only include the ones with the MIN(closedate) AND MIN(lastinvdate) in the fundbasecore table.&lt;br /&gt;
 DROP TABLE fundnameexclude;&lt;br /&gt;
 CREATE TABLE fundnameexclude AS&lt;br /&gt;
 SELECT fundname, COUNT(*) FROM (SELECT fundname FROM fundbase1 WHERE undisclosedflag = 0)a&lt;br /&gt;
 GROUP BY fundname&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundexclude;&lt;br /&gt;
 CREATE TABLE fundexclude AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundnameexclude as e ON f.fundname = e.fundname;&lt;br /&gt;
 --94  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundbase2;&lt;br /&gt;
 CREATE TABLE fundbase2 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0&lt;br /&gt;
 EXCEPT&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundexclude;&lt;br /&gt;
 --27003&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude;&lt;br /&gt;
 CREATE TABLE fundinclude AS&lt;br /&gt;
 SELECT fundname, MIN(closedate) AS closedate, MIN(lastinvdate) AS lastinvdate &lt;br /&gt;
 FROM fundexclude&lt;br /&gt;
 GROUP BY fundname;&lt;br /&gt;
 --47&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE fundinclude2;&lt;br /&gt;
 CREATE TABLE fundinclude2 AS&lt;br /&gt;
 SELECT f.*&lt;br /&gt;
 FROM fundbase1 AS f&lt;br /&gt;
 INNER JOIN fundinclude AS fu ON f.fundname = fu.fundname AND f.closedate = fu.closedate AND f.lastinvdate = fu.lastinvdate;&lt;br /&gt;
 --44&lt;br /&gt;
&lt;br /&gt;
 --create fundcore table&lt;br /&gt;
 DROP TABLE fundbasecore;&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT * FROM fundbase2&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM fundinclude2;&lt;br /&gt;
 --27047&lt;br /&gt;
&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19646</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19646"/>
		<updated>2017-08-01T17:00:46Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Fixing erroneous geo-coordinates */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
Then redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19645</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19645"/>
		<updated>2017-08-01T17:00:16Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning firmbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Fixing erroneous geo-coordinates==&lt;br /&gt;
Some of the geocoordinates in the db are dirty and point to locations in India, Eastern Europe. However, the company addresses exist. Isolate the dirty geo-coordinates and do a lookup using Geocode.py script. To isolate place a box around the continental US and flag all points that fall outside the box. Add back the points that are located in Hawaii and Puerto Rico. Then import back into db.&lt;br /&gt;
&lt;br /&gt;
I used longitude boundaries of -66 to -125 and latitude boundaries of 24 to 50.  &lt;br /&gt;
 --identify bad geo coords&lt;br /&gt;
 DROP TABLE badgeodata;&lt;br /&gt;
 CREATE TABLE badgeodata (&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  companyname varchar(100),&lt;br /&gt;
  startyear real,&lt;br /&gt;
  endyear real,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  noaddress int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY badgeodata FROM 'badgeodata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --43724&lt;br /&gt;
 DROP TABLE geodirtydata;&lt;br /&gt;
 CREATE TABLE geodirtydata AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 INNER JOIN badgeodata AS bg ON g.coname = bg.companyname;&lt;br /&gt;
 --30498&lt;br /&gt;
 \COPY geodirtydata TO 'geodirtydata.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geodirtydatawithflags;&lt;br /&gt;
 CREATE TABLE geodirtydatawithflags (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real,&lt;br /&gt;
  longdirtyflag int,&lt;br /&gt;
  latdirtyflag int,&lt;br /&gt;
  hawaiiflag int,&lt;br /&gt;
  prflag int,&lt;br /&gt;
  latlongflag int,&lt;br /&gt;
  masterflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtydatawithflags FROM 'geodirtydataflags.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --30498&lt;br /&gt;
 --import coordinates back into db&lt;br /&gt;
 DROP TABLE geodirtyfix;&lt;br /&gt;
 CREATE TABLE geodirtyfix (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geodirtyfix FROM 'geodirtyfix.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2300&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportclean;&lt;br /&gt;
 CREATE TABLE geoimportclean AS&lt;br /&gt;
 SELECT g.*&lt;br /&gt;
 FROM geoimport AS g&lt;br /&gt;
 WHERE g.coname NOT IN (SELECT coname FROM geodirtyfix);&lt;br /&gt;
 --40378&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE geoimportfix;&lt;br /&gt;
 CREATE TABLE geoimportfix AS&lt;br /&gt;
 SELECT * FROM geoimportclean&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT * FROM geodirtyfix&lt;br /&gt;
 WHERE latitude IS NOT NULL;&lt;br /&gt;
 --41718&lt;br /&gt;
The redo coleveloutput and colevelsimple using the geoimportfix as your geo table instead of geoimport.&lt;br /&gt;
&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19598</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19598"/>
		<updated>2017-07-31T20:06:41Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Creating coleveloutput */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating colevelsimple==&lt;br /&gt;
 DROP TABLE colevelsimple;&lt;br /&gt;
 CREATE TABLE colevelsimple AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, city, addr1, addr2, zip, aliveyear, deadyear, latitude, longitude&lt;br /&gt;
 FROM coleveloutput WHERE aliveyear IS NOT NULL and deadyear IS NOT NULL AND latitude IS NOT NULL;&lt;br /&gt;
 --31171&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19583</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19583"/>
		<updated>2017-07-31T15:44:46Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Gathering geo data from company addresses */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates called GeocodedVCData. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19582</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19582"/>
		<updated>2017-07-31T15:28:12Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Gathering geo data from company addresses */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key. Some of the geo coordinates are incorrect. This was found while analyzing the output data. I traced this back to a dirty file we initially used for geo coordinates. In the future the safest way to get geo-coordinates is to use the Geocode.py script by feeding company addresses.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19577</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19577"/>
		<updated>2017-07-28T20:28:30Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning fundbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
==Name based matching firms to funds==&lt;br /&gt;
Get the firms and fund keys and also include the firmname from the fundbasecore table. Run these two files through the Matcher. Then manually flag the multiple matches. There are only ~50 of them. Then reimport to vcdb2. &lt;br /&gt;
 DROP TABLE fundkeysandfirms;&lt;br /&gt;
 CREATE TABLE fundkeysandfirms AS&lt;br /&gt;
 SELECT fundname, firstinvdate, firmname&lt;br /&gt;
 FROM fundbasecore;&lt;br /&gt;
 --27097&lt;br /&gt;
 \COPY fundkeysandfirms TO 'fundkeysandfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE firmkeys;&lt;br /&gt;
 CREATE TABLE firmkeys AS&lt;br /&gt;
 SELECT firmname, statecode, foundingdate&lt;br /&gt;
 FROM firmbasecore;&lt;br /&gt;
 \COPY firmkeys TO 'firmkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE matcherfirmsfunds (&lt;br /&gt;
  firmname varchar(100),&lt;br /&gt;
  firmstatecode varchar(2),&lt;br /&gt;
  firmfoundingdate date,&lt;br /&gt;
  fundname varchar(100),&lt;br /&gt;
  fundfirstinvdate date,&lt;br /&gt;
  fundfirmname varchar(100),&lt;br /&gt;
  excludeflag int,&lt;br /&gt;
  excludeflagmaster int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherfirmsfunds FROM 'matcheroutputfundsfirms.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2364&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19575</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19575"/>
		<updated>2017-07-28T16:50:55Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning fundbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key.&lt;br /&gt;
 CREATE TABLE fundbasecore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19574</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19574"/>
		<updated>2017-07-28T16:49:29Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning fundbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27097&lt;br /&gt;
You can see that fundname, firstinvdate is a good key.&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19573</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19573"/>
		<updated>2017-07-28T16:44:06Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning fundbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Fund%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM fundbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --27097&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19572</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19572"/>
		<updated>2017-07-28T16:34:14Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning fundbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT fundname, closedate, firstinvdate FROM fundbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --27571&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19569</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19569"/>
		<updated>2017-07-27T22:16:33Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning fundbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
First flag the undisclosed funds.&lt;br /&gt;
 CREATE TABLE fundbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN fundname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM fundbase;&lt;br /&gt;
 --27588&lt;br /&gt;
&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19565</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19565"/>
		<updated>2017-07-27T21:38:29Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning firmbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The firmbase table contains undisclosed firms. Add a flag and remove them. Then use firmname, statecode, foundingdate as the key for this table. Check that is valid and make your core table.&lt;br /&gt;
 CREATE TABLE firmbase1 AS&lt;br /&gt;
 SELECT *, CASE&lt;br /&gt;
 WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int&lt;br /&gt;
 ELSE 0::int END AS undisclosedflag&lt;br /&gt;
 FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, foundingdate FROM firmbase1 WHERE undisclosedflag = 0)a;&lt;br /&gt;
 --14145&lt;br /&gt;
 &lt;br /&gt;
 CREATE TABLE firmbasecore AS&lt;br /&gt;
 SELECT * FROM firmbase1 WHERE undisclosedflag = 0;&lt;br /&gt;
 --14145&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19562</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19562"/>
		<updated>2017-07-27T21:11:35Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning firmbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The key I chose after trying a few different combinations is firmname, statecode, addr1.&lt;br /&gt;
 SELECT COUNT(*) FROM firmbase;&lt;br /&gt;
 --14567&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT firmname, statecode, addr1 FROM firmbase)a;&lt;br /&gt;
 --14250&lt;br /&gt;
We'll need to investigate the 317 duplicate keys.&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19559</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19559"/>
		<updated>2017-07-27T19:38:00Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning firmbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
The key I chose after trying a few different combinations is firmname, statecode, addr1.&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19558</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19558"/>
		<updated>2017-07-27T16:48:50Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Cleaning fundbase */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;br /&gt;
==Cleaning roundline==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19556</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19556"/>
		<updated>2017-07-27T16:27:45Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Creating Stage Flags Table */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;br /&gt;
==Cleaning firmbase==&lt;br /&gt;
&lt;br /&gt;
==Cleaning fundbase==&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19553</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19553"/>
		<updated>2017-07-27T15:56:06Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Gathering geo data from company addresses */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here. When you copy the addresses out of the database be sure to include a distinct key that will allow you to join the geo data back with the portco key.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19545</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19545"/>
		<updated>2017-07-26T21:59:03Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Creating roundplus */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, c.city, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate&lt;br /&gt;
 LEFT JOIN companybasecore AS c ON c.coname = roundcore.coname AND c.statecode = roundcore.statecode AND c.datefirstinv = &lt;br /&gt;
 roundcore.datefirstinv;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19544</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19544"/>
		<updated>2017-07-26T21:58:31Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Creating roundplus */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Creating round level outputs==&lt;br /&gt;
roundplus is used to build the two table outputs below at the round level.&lt;br /&gt;
 DROP TABLE roundleveloutput;&lt;br /&gt;
 CREATE TABLE roundleveloutput AS&lt;br /&gt;
 SELECT city, statecode, roundyear as year, &lt;br /&gt;
 sum(roundamtm*seedflag) AS seedamtm,&lt;br /&gt;
 sum(roundamtm*earlyflag) AS earlyamtm,&lt;br /&gt;
 sum(roundamtm*laterflag) AS lateramtm,&lt;br /&gt;
 sum(roundamtm*growthflag) AS selamtm,&lt;br /&gt;
 sum(seedflag) AS numseeds,&lt;br /&gt;
 sum(earlyflag) AS numearly,&lt;br /&gt;
 sum(laterflag) AS numlater,&lt;br /&gt;
 sum(growthflag) AS numsel,&lt;br /&gt;
 sum(dealflag) AS numdeals&lt;br /&gt;
 FROM roundplus WHERE hadgrowthvc=1 GROUP BY city, statecode, roundyear ORDER BY city, statecode, roundyear;&lt;br /&gt;
 --22266  &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE roundleveloutput2;&lt;br /&gt;
 CREATE TABLE roundleveloutput2 AS &lt;br /&gt;
 SELECT roundleveloutput.*, numalive&lt;br /&gt;
 FROM roundleveloutput&lt;br /&gt;
 LEFT JOIN alivecount ON alivecount.city=roundleveloutput.city AND alivecount.statecode=roundleveloutput.statecode AND &lt;br /&gt;
 alivecount.year=roundleveloutput.year;&lt;br /&gt;
 --22266&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19541</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19541"/>
		<updated>2017-07-26T21:50:49Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* coleveloutput */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==Creating coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating roundplus==&lt;br /&gt;
 DROP TABLE roundplus;&lt;br /&gt;
 CREATE TABLE roundplus AS&lt;br /&gt;
 SELECT roundcore.*, seedflag, earlyflag, laterflag, growthflag, transactionflag, excludeflag,&lt;br /&gt;
 CASE WHEN roundcore.datefirstinv=roundcore.rounddate THEN 1::int ELSE 0::int END as dealflag,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc,&lt;br /&gt;
 extract(year from roundcore.rounddate) as roundyear,&lt;br /&gt;
 CASE WHEN rndamtdisck IS NOT NULL THEN rndamtdisck/1000 WHEN rndamtdisck IS NULL AND rndamtestk IS NOT NULL THEN rndamtestk/1000 ELSE &lt;br /&gt;
 NULL::real END as roundamtm&lt;br /&gt;
 FROM roundcore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=roundcore.coname AND SelFlagBase.statecode=roundcore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=roundcore.datefirstinv&lt;br /&gt;
 LEFT JOIN stageflags ON  stageflags.coname=roundcore.coname AND stageflags.statecode=roundcore.statecode AND &lt;br /&gt;
 stageflags.datefirstinv=roundcore.datefirstinv AND stageflags.rounddate=roundcore.rounddate;&lt;br /&gt;
 --143001&lt;br /&gt;
&lt;br /&gt;
 SELECT coname, rounddate FROM (SELECT coname, rounddate FROM roundplus)a&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 DELETE FROM roundplus WHERE coname = 'New York Digital Health LLC';&lt;br /&gt;
 --2&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19540</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19540"/>
		<updated>2017-07-26T21:16:57Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Creating Stage Flags Table */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Cleaning round table==&lt;br /&gt;
Use coname, rounddate as the key for this table. Exclude all keys that occur more than once.&lt;br /&gt;
 CREATE TABLE roundexclude AS&lt;br /&gt;
 SELECT * FROM (&lt;br /&gt;
 SELECT coname, rounddate FROM round) t&lt;br /&gt;
 GROUP BY coname, rounddate&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --154&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE roundcore AS&lt;br /&gt;
 SELECT * FROM round&lt;br /&gt;
 WHERE NOT EXISTS (SELECT * FROM roundexclude AS re WHERE re.coname = round.coname AND re.rounddate = round.rounddate); &lt;br /&gt;
 --143000&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19539</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19539"/>
		<updated>2017-07-26T20:34:59Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Build dead/alive flags */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
==coleveloutput==&lt;br /&gt;
One of the output tables required by the other researchers is the coleveloutput table. It contains company, geo and ipo/ma details in the form of aliveyear, deadyear. Here's how you build it:&lt;br /&gt;
 DROP TABLE SelFlagBase;&lt;br /&gt;
 CREATE TABLE SelFlagBase AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv from stageflags where growthflag=1;&lt;br /&gt;
 --32597&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasecore2;&lt;br /&gt;
 CREATE TABLE companybasecore2 AS&lt;br /&gt;
 SELECT companybasecore.*,&lt;br /&gt;
 CASE WHEN SELFlagbase.coname IS NOT NULL THEN 1::int ELSE 0::int END AS hadgrowthvc&lt;br /&gt;
 FROM companybasecore&lt;br /&gt;
 LEFT JOIN SelFlagBase ON SelFlagBase.coname=companybasecore.coname AND SelFlagBase.statecode=companybasecore.statecode AND &lt;br /&gt;
 SelFlagBase.datefirstinv=companybasecore.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore2 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE coleveloutput;&lt;br /&gt;
 CREATE TABLE coleveloutput AS&lt;br /&gt;
 SELECT companybasecore2.coname, companybasecore2.statecode, companybasecore2.datefirstinv, companybasecore2.city, &lt;br /&gt;
 companybasecore2.addr1, companybasecore2.addr2, companybasecore2.zip, g.latitude, g.longitude, d.deaddate, d.aliveyear, d.deadyear&lt;br /&gt;
 FROM companybasecore2  &lt;br /&gt;
 LEFT JOIN deadalive1 AS d ON d.coname=companybasecore2.coname AND d.statecode=companybasecore2.statecode AND &lt;br /&gt;
 d.datefirstinv=companybasecore2.datefirstinv&lt;br /&gt;
 LEFT JOIN geoimport AS g ON g.coname = companybasecore2.coname AND g.statecode = companybasecore2.statecode AND g.datefirstinv = &lt;br /&gt;
 companybasecore2.datefirstinv&lt;br /&gt;
 WHERE hadgrowthvc=1;&lt;br /&gt;
 --32575&lt;br /&gt;
 \COPY coleveloutput TO 'coleveloutput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19538</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19538"/>
		<updated>2017-07-26T19:51:35Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Build dead/alive flags */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to build out a master table that has dead and alive flags and the company counts for each year in the database by datefirstinv. &lt;br /&gt;
 CREATE TABLE stageflagscore AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
 END AS selflag&lt;br /&gt;
 FROM stageflags;&lt;br /&gt;
 --143347&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE selcos;&lt;br /&gt;
 CREATE TABLE selcos AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
 FROM stageflagscore&lt;br /&gt;
 WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
 --32597 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE deadalive;&lt;br /&gt;
 CREATE TABLE deadalive AS&lt;br /&gt;
 SELECT deaddate1.*, sel.selflag&lt;br /&gt;
 FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND &lt;br /&gt;
 deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --match to sel flag&lt;br /&gt;
 DROP TABLE deadalivesel;&lt;br /&gt;
 CREATE TABLE deadalivesel AS&lt;br /&gt;
 SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
 FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = &lt;br /&gt;
 flags.datefirstinv;&lt;br /&gt;
 --143310&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deadalive1 as&lt;br /&gt;
 SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
 extract(year from deaddate) AS deadyear&lt;br /&gt;
 FROM deadalive WHERE selflag=1;&lt;br /&gt;
 --32575&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE tempbase;&lt;br /&gt;
 CREATE TABLE tempbase As&lt;br /&gt;
 SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
 FROM allyears&lt;br /&gt;
 JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
 --239446&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE alivecount;&lt;br /&gt;
 CREATE TABLE alivecount AS&lt;br /&gt;
 SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
 GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
 --42296&lt;br /&gt;
 \COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19537</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19537"/>
		<updated>2017-07-26T19:35:24Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Build dead/alive flags */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
You will need to run the queries below to &lt;br /&gt;
CREATE TABLE stageflagscore AS&lt;br /&gt;
SELECT *,&lt;br /&gt;
CASE WHEN seedflag = 1 OR earlyflag = 1 OR laterflag = 1 THEN 1::int ELSE 0::int&lt;br /&gt;
END AS selflag&lt;br /&gt;
FROM stageflags;&lt;br /&gt;
--143347&lt;br /&gt;
&lt;br /&gt;
DROP TABLE selcos;&lt;br /&gt;
CREATE TABLE selcos AS&lt;br /&gt;
SELECT DISTINCT coname, statecode, datefirstinv, selflag&lt;br /&gt;
FROM stageflagscore&lt;br /&gt;
WHERE excludeflag = 0 AND selflag = 1;&lt;br /&gt;
--32597&lt;br /&gt;
&lt;br /&gt;
DROP TABLE deadalive;&lt;br /&gt;
CREATE TABLE deadalive AS&lt;br /&gt;
SELECT deaddate1.*, sel.selflag&lt;br /&gt;
FROM deaddate1 LEFT JOIN selcos AS sel ON deaddate1.coname = sel.coname AND deaddate1.statecode = sel.statecode AND deaddate1.datefirstinv = sel.datefirstinv;&lt;br /&gt;
--44740&lt;br /&gt;
&lt;br /&gt;
--match to sel flag&lt;br /&gt;
DROP TABLE deadalivesel;&lt;br /&gt;
CREATE TABLE deadalivesel AS&lt;br /&gt;
SELECT da.*, flags.stage3, flags.seedflag, flags.earlyflag, flags.laterflag, flags.growthflag, flags.transactionflag, flags.excludeflag&lt;br /&gt;
FROM deadalive AS da LEFT JOIN stageflags AS flags ON da.coname = flags.coname AND da.statecode = flags.statecode AND da.datefirstinv = flags.datefirstinv;&lt;br /&gt;
--143310&lt;br /&gt;
&lt;br /&gt;
CREATE TABLE deadalive1 as&lt;br /&gt;
SELECT coname, city, statecode, datefirstinv, datelastinv, deaddate, extract(year from datefirstinv) as aliveyear,&lt;br /&gt;
extract(year from deaddate) AS deadyear&lt;br /&gt;
FROM deadalive WHERE selflag=1;&lt;br /&gt;
--32575&lt;br /&gt;
&lt;br /&gt;
DROP TABLE tempbase;&lt;br /&gt;
CREATE TABLE tempbase As&lt;br /&gt;
SELECT DISTINCT year, coname, city, statecode&lt;br /&gt;
FROM allyears&lt;br /&gt;
JOIN deadalive1 ON year&amp;gt;=extract(year from datefirstinv) AND year&amp;lt;=deadyear;&lt;br /&gt;
--239446&lt;br /&gt;
&lt;br /&gt;
DROP TABLE alivecount;&lt;br /&gt;
CREATE TABLE alivecount AS&lt;br /&gt;
SELECT city, statecode, year, count(coname) as numalive FROM tempbase&lt;br /&gt;
GROUP BY city, statecode, year ORDER by count(coname) DESC;&lt;br /&gt;
--42296&lt;br /&gt;
\COPY alivecount TO 'alivecount.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19506</id>
		<title>VC Database Rebuild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=VC_Database_Rebuild&amp;diff=19506"/>
		<updated>2017-07-25T22:11:31Z</updated>

		<summary type="html">&lt;p&gt;AdrianS: /* Gathering geo data from company addresses */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Plan==&lt;br /&gt;
Rebuild roundbase, round, geo, ipos, mas from SDC data.  &lt;br /&gt;
Create companybase from roundbase&lt;br /&gt;
Create round from roundbase.&lt;br /&gt;
Build stageflags from round.&lt;br /&gt;
&lt;br /&gt;
Clean companybase by putting flags for Undisclosed Company, US location. Check if key (coname, statecode, datefirstinv) is valid. Remove duplicates manually/update command from roundbase. Check if round key is valid. Remove duplicates.&lt;br /&gt;
&lt;br /&gt;
Build statelookup tables and roundlookup tables.&lt;br /&gt;
&lt;br /&gt;
Clean firmbase tables. Clean ipo tables. Clean mas table.&lt;br /&gt;
&lt;br /&gt;
Run matcher on ipos, companybase. Matcher on mas, companybase. Fix duplicate matches.&lt;br /&gt;
&lt;br /&gt;
Join ipos and companybase. Check if count is valid. Fix match as required. Pull ipo key into companybase and companybase key into ipo table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join mas and companybase. Check if count is valid. Fix match as required. Pull mas key into companybase and companybase key into mas table first. Then join.&lt;br /&gt;
&lt;br /&gt;
Join ipocompanybase with macompanybase to get a table of portcos, ipos and mas.&lt;br /&gt;
&lt;br /&gt;
Calculate exit date based on ipo, ma, datelastinv + 5 years.&lt;br /&gt;
&lt;br /&gt;
Pull in sel flag into companybase and build dead or alive flag. &lt;br /&gt;
&lt;br /&gt;
Match geodata to companybase. Pull geokey into companybase table. Lookup addresses to get geo data as required using geo.py.&lt;br /&gt;
&lt;br /&gt;
Clean fundbase and check valid key (fundname, statecode, firstinvdate)&lt;br /&gt;
&lt;br /&gt;
Clean firmbase and check valid key (firmname, foundingdate)&lt;br /&gt;
&lt;br /&gt;
==Loading starting data into database==&lt;br /&gt;
Database is named vcdb2. It is located in /bulk/VentureCapitalData/SDCVCData. Launch with psql vcdb2. Load the following tables by running the commands below. Make sure the sql scripts and data txt files are all located in the folder. Check that the line numbers copied into your new tables match the line numbers in the Load files.&lt;br /&gt;
&lt;br /&gt;
 \i LoadFunds.sql&lt;br /&gt;
 \i LoadIPOs.sql&lt;br /&gt;
 \i LoadRoundbase.sql&lt;br /&gt;
 \i LoadFirms.sql&lt;br /&gt;
 \i LoadGeoData.sql&lt;br /&gt;
 \i LoadLongDescription.sql&lt;br /&gt;
 \i LoadRound.sql&lt;br /&gt;
&lt;br /&gt;
==Cleaning Process==&lt;br /&gt;
The roundbase table which is used to build the core company and round tables contains some data that we would like to remove like Undisclosed companies and duplicate entries. In order to find what to clean, build your companybase table first. You know your companybase table is clean once it contains a 1:1 relationship between keys and entries. We will then apply these changes to the roundbase table because any cleaning changes made downstream should be incorporated upstream into the base table. Otherwise when you build anything else off your roundbase table, dirty keys will infect the other areas of your database. Once the roundbase table is clean we will rename it roundbasecore so that we know it is clean and good to use for building other core tables.  &lt;br /&gt;
&lt;br /&gt;
==Creating Base Tables==&lt;br /&gt;
Create the base tables, companybase and round, by running the following scripts. These are the initial tables you will need to clean and join in order to get the master tables.&lt;br /&gt;
 &lt;br /&gt;
 DROP TABLE companybase;&lt;br /&gt;
 CREATE TABLE companybase AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip &lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE round;&lt;br /&gt;
 CREATE TABLE round AS&lt;br /&gt;
 SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors&lt;br /&gt;
 FROM roundbase&lt;br /&gt;
 ORDER BY coname;&lt;br /&gt;
&lt;br /&gt;
==Cleaning the Companybase table==&lt;br /&gt;
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.&lt;br /&gt;
&lt;br /&gt;
So first check to see if the key is valid on the base data using the following 2 queries.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44774&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;&lt;br /&gt;
 --44771&lt;br /&gt;
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.&lt;br /&gt;
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.&lt;br /&gt;
 DROP TABLE companybase1;&lt;br /&gt;
 CREATE TABLE companybase1 AS &lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN nationcode = 'US' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS alwaysusflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN coname = 'Undisclosed Company' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS undisclosedflag&lt;br /&gt;
 FROM companybase;&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44771&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;&lt;br /&gt;
 --44770&lt;br /&gt;
By looking at the counts you can see that there is still 1 duplicate key in the table. Let's find it another way. Running the query below finds the key (coname, statecode, datefirstinv) that appears twice in the table. &lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)AS key&lt;br /&gt;
 GROUP BY coname, statecode, datefirstinv&lt;br /&gt;
 HAVING COUNT(key) &amp;gt; 1;&lt;br /&gt;
The output looks like this:&lt;br /&gt;
           coname            | statecode | datefirstinv&lt;br /&gt;
 ----------------------------+-----------+--------------&lt;br /&gt;
 New York Digital Health LLC | NY        | 2015-08-13&lt;br /&gt;
We'll have to copy companybase1 out of the db and have a look on textpad for something unique about one of the entries on New York Digital Health LLC that we can use to manually delete it from the companybase1 table. Turns out the url is different so we'll use that.&lt;br /&gt;
Manually delete this record from the roundbase table using the below command. Now we're ready to build the companybasecore table.&lt;br /&gt;
 DELETE FROM roundbase WHERE coname = 'New York Digital Health LLC' AND statecode = 'NY' AND datefirstinv = to_date('2015-08-13', 'YYYY-MM-DD') AND url = 'www.digitalhealthaccelerator.c';&lt;br /&gt;
&lt;br /&gt;
==companybasecore table==&lt;br /&gt;
The queries below build your companybasecore table. The where clause takes the place of the 2 flags on nationcode and undisclosed company we built in companybase1 table. This table has a guaranteed 1:1 relationship between coname, statecode, datefirstinv and an entry in the table. The two queries at the end verify this. We use core tables to run joins later on. &lt;br /&gt;
 DROP TABLE companybasecore;&lt;br /&gt;
 CREATE TABLE companybasecore AS&lt;br /&gt;
 SELECT DISTINCT &lt;br /&gt;
 coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip&lt;br /&gt;
 FROM roundbase WHERE nationcode = 'US' AND coname != 'Undisclosed Company';&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 --recheck keys&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Cleaning ipos table==&lt;br /&gt;
Check to see if the existing keys in the table are valid. We are using issuer, issuedate, statecode as the key.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --10440&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipos)a;&lt;br /&gt;
 --9491&lt;br /&gt;
The keys are not unique so we must remove duplicate keys first. You will need to try different methods to isolate the duplicate keys. This is where you can be creative. I first started by finding the duplicates based on issuer, issuedate and statecode which is our key. Have a look in textpad/excel for ways to filter these keys. We would like to save as much information as possible so rather than excluding all these entries which sum to 1888 rows in the ipos table maybe there's some other way we can filter out records and still have distinct keys.&lt;br /&gt;
 DROP TABLE ipoduplicates;&lt;br /&gt;
 CREATE TABLE ipoduplicates AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipos)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --939&lt;br /&gt;
 \COPY ipoduplicates TO 'ipoduplicates.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;&lt;br /&gt;
In the file you will notice that many keys contain different principalamts. Let's keep the MAX principal amount and throw out the same key that has a lower principalamt. This query is shown below.&lt;br /&gt;
 DROP TABLE ipoinclude;&lt;br /&gt;
 CREATE TABLE ipoinclude AS&lt;br /&gt;
 SELECT issuer, issuedate, statecode, MAX(principalamt) AS principalamt&lt;br /&gt;
 FROM ipos&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode;&lt;br /&gt;
 --9470&lt;br /&gt;
Now use the ipoinclude table to create a ipocore table. Then check to see if this core table has unique keys so 1 key matches with 1 record. This is the defining characteristic of a core table.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.issuer, ipos.issuedate, ipos.statecode&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
You should notice that the ipocore table count does not match the count of DISTINCT keys. This means there are still some duplicates. So I created another duplicate table.&lt;br /&gt;
 DROP TABLE ipoduplicates2;&lt;br /&gt;
 CREATE TABLE ipoduplicates2 AS&lt;br /&gt;
 SELECT *, COUNT(*)&lt;br /&gt;
 FROM (SELECT issuer, issuedate, statecode FROM ipocore)a&lt;br /&gt;
 GROUP BY issuer, issuedate, statecode&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
Then I created DELETE statements for all these entries. I deleted them from the ipoinclude table which will prevent these keys from appearing in the ipocore table when you JOIN the ipos with ipoinclude table.&lt;br /&gt;
 --manually remove bad keys&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'PacTel Corp' AND statecode = 'CA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Dragon Fund Inc' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sterling Commerce' AND statecode = 'TX';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Sothebys Holdings Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'TD Waterhouse Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Berlitz International Inc' AND statecode = 'NJ';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Spain Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Ultramar Corp' AND statecode = 'CT';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Goldman Sachs Group Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Fidelity Advisor Korea Fund' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Euronet Services Inc' AND statecode = 'KS';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Emerging Markets Tele Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'FirstMiss Gold Inc' AND statecode = 'NV';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Templeton Vietnam Opportunitie' AND statecode = 'FL';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Hybridon Inc' AND statecode = 'MA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Indonesia Fund Inc' AND statecode = 'NY';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'OpenTV Corp' AND statecode = 'CA';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Scudder New Europe Fund' AND statecode = 'NY';&lt;br /&gt;
 --2&lt;br /&gt;
 DELETE FROM ipoinclude WHERE issuer = 'Austria Fund Inc' AND statecode = 'NY';  &lt;br /&gt;
 --2&lt;br /&gt;
Now again JOIN your ipos table with your ipoinclude table and check the key count.&lt;br /&gt;
 DROP TABLE ipocore;&lt;br /&gt;
 CREATE TABLE ipocore AS&lt;br /&gt;
 SELECT ipos.*&lt;br /&gt;
 FROM ipos INNER JOIN ipoinclude ON ipos.issuer = ipoinclude.issuer AND ipos.issuedate = ipoinclude.issuedate AND &lt;br /&gt;
 ipos.statecode = ipoinclude.statecode AND ipos.principalamt = ipoinclude.principalamt;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT issuer, issuedate, statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
The counts line up so now you should have a clean ipocore table!&lt;br /&gt;
&lt;br /&gt;
==Cleaning mas table==&lt;br /&gt;
Check to see if you have bad keys in the table. The row count of the table should match up with count of distinct keys based on targetname, targetstatecode, announceddate. &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 --114890 &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mas)a;&lt;br /&gt;
 --114825&lt;br /&gt;
Great! The counts don't match so we'll have to clean the mas table. There is no obvious field to filter against with mas. So I inserted an id column in mas and took the MIN id for duplicate keys.&lt;br /&gt;
 CREATE TABLE mas1 AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM mas;&lt;br /&gt;
 ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
 ALTER TABLE mas ADD COLUMN id SERIAL PRIMARY KEY;&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE masinclude;&lt;br /&gt;
 CREATE TABLE masinclude AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate, MIN(id) as id&lt;br /&gt;
 FROM mas1&lt;br /&gt;
 GROUP BY targetname, targetstatecode, announceddate;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE mascore;&lt;br /&gt;
 CREATE TABLE mascore AS&lt;br /&gt;
 SELECT mas.*&lt;br /&gt;
 FROM mas INNER JOIN masinclude ON mas.id = masinclude.id; &lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM (SELECT DISTINCT targetname, targetstatecode, announceddate FROM mascore)a;&lt;br /&gt;
The mas distinct key count match the total count of the table so therefore the mascore table is clean.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to mas keys==&lt;br /&gt;
Before attempting to match companybasecore with mascore you need a clean table or you will get many errors in the matcher output file. Luckily the core tables should already contain distinct keys if you've followed the process. However running the matcher will still yield many errors. So we will filter the mas keys some more. The first thing is to remove mas keys (targetname, announceddate, targetstatecode) where the announceddate falls within the same week. Keep the key that has the minimum announceddate and discard the higher date. Shown below:&lt;br /&gt;
 DROP TABLE maskeys;&lt;br /&gt;
 CREATE TABLE maskeys AS&lt;br /&gt;
 SELECT DISTINCT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM mascore;&lt;br /&gt;
 --114825&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysmindates;&lt;br /&gt;
 CREATE TABLE maskeysmindates AS&lt;br /&gt;
 SELECT targetname, targetstatecode, MIN(announceddate) AS announceddate&lt;br /&gt;
 FROM mascore&lt;br /&gt;
 GROUP BY targetname, targetstatecode;&lt;br /&gt;
 --113236&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE maskeysdatewindow;&lt;br /&gt;
 CREATE TABLE maskeysdatewindow AS&lt;br /&gt;
 SELECT maskeys.*, maskeysmindates.announceddate as minanndate,&lt;br /&gt;
 CASE WHEN maskeys.announceddate - INTERVAL '7 day' &amp;gt; maskeysmindates.announceddate OR maskeys.announceddate = &lt;br /&gt;
 maskeysmindates.announceddate THEN 1::int&lt;br /&gt;
 ELSE 0::int&lt;br /&gt;
 END AS dateflag&lt;br /&gt;
 FROM maskeys LEFT JOIN maskeysmindates ON (maskeys.targetname = maskeysmindates.targetname AND &lt;br /&gt;
 maskeys.targetstatecode = maskeysmindates.targetstatecode);&lt;br /&gt;
 --114825&lt;br /&gt;
The dateflag is 1 when the current key's announceddate is 1 week older than the minimum announced date or it is the minimum announceddate for that targetname, targetstatecode pair. If the announceddate is less than 1 week greater than the minimum announceddate for te targetname, targetstatecode pair, then it is 0.    &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE maskeysdatefiltered AS&lt;br /&gt;
 SELECT targetname, targetstatecode, announceddate&lt;br /&gt;
 FROM maskeysdatewindow&lt;br /&gt;
 WHERE dateflag = 1; &lt;br /&gt;
 --114794&lt;br /&gt;
 \COPY maskeysdatefiltered TO 'maskeysdatefiltered.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Grab the portco keys from the companybasecore table:&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
Put the portcokeys and maskeysdatefiltered text files into the Matcher Input folder. For more instructions on how to run the Matcher see [[The Matcher (Tool)]]&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and mas==&lt;br /&gt;
You will still receive multiple warnings in the output.matched file. In Excel add flags to exclude if the announceddate &amp;lt; datefirstinv and another exclude flag if the datefirstinv = announceddate. Also add a warning flag if the Warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Then import this back into your db by creating a matcheroutput table.&lt;br /&gt;
 DROP TABLE matcherportcomas;&lt;br /&gt;
 CREATE TABLE matcherportcomas (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2targetname varchar(100),&lt;br /&gt;
  file2targetstatecode varchar (2),&lt;br /&gt;
  file2announceddate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcomas FROM 'matcheroutputportco-mas.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --9645&lt;br /&gt;
You've imported 9,645 matches into your matcher table in vcdb2 but if you run the query below you will get the number of &amp;quot;good&amp;quot; matches. These are matches that do not contain warnings, where the datefirstinv &amp;gt; announceddate for a merger/acquisition and where the datefirstinv does not equal the announceddate.&lt;br /&gt;
 SELECT COUNT(*) FROM &lt;br /&gt;
 (SELECT file1coname, file1statecode, file2targetname, file2targetstatecode FROM matcherportcomas WHERE excludeflag1 = 0 AND &lt;br /&gt;
 excludeflag2 = 0 AND warningflag = 0)a;  &lt;br /&gt;
 --8291&lt;br /&gt;
As you can see we're throwing out a lot of the data in the matcher file (9645 -&amp;gt; 8291). So the next few queries will try and save as much of the bad matches as possible and add them back to the good matches to create our matcherportcomascore table.&lt;br /&gt;
 &lt;br /&gt;
Select the portco keys that are matched to the minimum announceddate for any mergers:&lt;br /&gt;
 DROP TABLE matcherwarningmindates;&lt;br /&gt;
 CREATE TABLE matcherwarningmindates AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2announceddate) &lt;br /&gt;
 FROM matcherportcomas &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --364&lt;br /&gt;
Then using the temporary key (file1coname, file1statecode, file1datefirstinv, file2announceddate) join this back to the original matcher table to get the rest of the data we will want in the core table. &lt;br /&gt;
 DROP TABLE matcherportcomasinclude;&lt;br /&gt;
 CREATE TABLE matcherportcomasinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcomas AS m INNER JOIN matcherwarningmindates AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = &lt;br /&gt;
 mi.file1statecode AND m.file1datefirstinv = mi.file1datefirstinv AND m.file2announceddate = mi.min  &lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --366&lt;br /&gt;
The inner join result should equal the amount in the matcherwarningmindates table but it doesn't. So to find the dirty entries we'll use the query below.&lt;br /&gt;
 SELECT *, COUNT(*) FROM&lt;br /&gt;
 (SELECT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
&lt;br /&gt;
 file1coname       | file1statecode | file1datefirstinv | count&lt;br /&gt;
 -------------------------+----------------+-------------------+-------&lt;br /&gt;
  PA Inc                  | TX             | 2007-09-25        |     2&lt;br /&gt;
  High Sierra Energy L.P. | CO             | 2004-12-23        |     2&lt;br /&gt;
Find these records in the matcherportcomas table in Excel and delete 1 entry from each manually:&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'PA Inc' AND file1statecode = 'TX' AND file2targetname = 'PA Corp' &lt;br /&gt;
 AND file2targetstatecode = 'VA';&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM matcherportcomasinclude WHERE file1coname = 'High Sierra Energy L.P.' AND file1statecode = 'CO' AND &lt;br /&gt;
 file2targetname = 'High Sierra Energy GP LLC' AND file2targetstatecode = 'CO';&lt;br /&gt;
 --1&lt;br /&gt;
Now we should have a clean matcherportcomasinclude table. To be sure check the number of distinct matches using the query below. It should be the same as the number of records in this table.&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcomasinclude)a;&lt;br /&gt;
 --364&lt;br /&gt;
 SELECT COUNT(*) FROM matcherportcomasinclude;&lt;br /&gt;
 --364&lt;br /&gt;
Looks good so let's UNION ALL to join the matcherportcomasinclude table with the matcherportcomas with all flags set to 0 to create the core table.&lt;br /&gt;
 CREATE TABLE matcherportcomascore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomas  WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcomasinclude;&lt;br /&gt;
 --8655&lt;br /&gt;
Recheck the key counts. 1 portco key from the companybase table should match with exactly 1 mas key from the mascore table. If you have more than 1:1 you will get errors in the next phase when you join the companybase table to the mas table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
 SELECT DISTINCT file1coname, file1statecode, file1datefirstinv&lt;br /&gt;
 FROM matcherportcomascore) AS foo;&lt;br /&gt;
 --8655&lt;br /&gt;
Great! Now you are ready to begin joining the companybase table to the mas table.&lt;br /&gt;
==Joining companybasekeys with maskeys and ipokeys==&lt;br /&gt;
Before doing this stage make sure the following is true:&lt;br /&gt;
#companybasecore, mascore, ipocore are clean core tables...They should be 1:1 on themselves. That means 1 key should match to one row in each respective table. See [[#Cleaning the Companybase table|Cleaning the Companybase table]] [[#Cleaning mas table|Cleaning mas table]] [[#Cleaning ipos table|Cleaning ipos table]] for instructions&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and mascore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to mas keys|Name Based Matching companybase keys to mas keys]] and [[#Fixing Errors in the Matcher Output for portco and mas|Fixing Errors in the Matcher Output for portco and mas]]&lt;br /&gt;
#You've done name based matching on the keys in companybasecore and ipocore and cleaned up the matcher output file. See [[#Name Based Matching companybase keys to ipo keys|Name Based Matching companybase keys to ipo keys]] and [[#Fixing Errors in the Matcher Output for portco and ipo|Fixing Errors in the Matcher Output for portco and ipo]] &lt;br /&gt;
&lt;br /&gt;
We want to join the three sets of keys together before grabbing other data from their respective tables because there will be collisions with the maskeys and ipokeys. Some companies will have ipos as well as mergers/acquisitions or the data might also be miss coded by SDC platinum. The problem for us is a company that has both an ipo and ma will cause our join row counts to increase every time we join with these duplicate keys. We want a portcokey to join with only one ipokey or maskey in our master table. Running the query below creates a table that contains the three sets of keys:&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeysaddipokeys AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, matcherm.file2targetname AS mastargetname, matcherm.file2targetstatecode AS masstatecode, &lt;br /&gt;
 matcherm.file2announceddate AS announceddate, matcheri.file2issuer AS ipoissuer, matcheri.file2statecode AS ipostatecode, &lt;br /&gt;
 matcheri.file2issuedate AS ipoissuedate  FROM&lt;br /&gt;
 companybasecore AS c LEFT JOIN matcherportcomascore as matcherm ON c.coname = matcherm.file1coname AND c.statecode = &lt;br /&gt;
 matcherm.file1statecode AND c.datefirstinv = matcherm.file1datefirstinv&lt;br /&gt;
 LEFT JOIN matcherportcoipocore AS matcheri ON c.coname = matcheri.file1coname AND c.statecode = matcheri.file1statecode AND &lt;br /&gt;
 c.datefirstinv = matcheri.file1datefirstinv; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeysaddipokeys TO 'companybasekeysaddmaskeysaddipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
Open in Excel and add a flag to see the rows with an ipokey as well as a maskey. You can use a formula like this: =IF(OR(ISBLANK(G518),ISBLANK(D518)),0,1). You'll see there are 83 portcokeys that match to an ipokey and a maskey. We'll write a query in sql to take the ipokey or maskey with the lowest date attached to it. This will be the exit date for that portco.&lt;br /&gt;
First we create a table that has the minimum exit date. Then we add flags to indicate when the ipokey is valid and when the maskey is valid. Then we create a companybasekeymaskeyipokeycore table that contains clean matches from companybasekey (portcokey) to ipo or mas.&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindate; &lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindate AS&lt;br /&gt;
 SELECT *, &lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate IS NOT NULL AND ipoissuedate IS NOT NULL THEN LEAST(announceddate,ipoissuedate)&lt;br /&gt;
  WHEN announceddate IS NOT NULL THEN announceddate&lt;br /&gt;
  WHEN ipoissuedate IS NOT NULL THEN ipoissuedate&lt;br /&gt;
  END AS masterdate&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasekeysaddmaskeyaddipokeysmindate TO 'companybasekeysaddmaskeyaddipokeysmindate.txt' WITH DELIMITER AS E'\t' HEADER NULL &lt;br /&gt;
 AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeysaddmaskeyaddipokeysmindateflag;&lt;br /&gt;
 CREATE TABLE companybasekeysaddmaskeyaddipokeysmindateflag AS&lt;br /&gt;
 SELECT keys.*,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN announceddate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS maskeyvalid,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN ipoissuedate = masterdate THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS ipokeyvalid&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindate as keys;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
Now create the companybaseipokeycore and companybasemaskeycore tables using the flags created above.&lt;br /&gt;
 DROP TABLE companybasekeymaskeycore;&lt;br /&gt;
 CREATE TABLE companybasekeymaskeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.mastargetname, c.masstatecode, c.announceddate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE maskeyvalid = 1;&lt;br /&gt;
 --8610 &lt;br /&gt;
&lt;br /&gt;
 DROP TABLE companybasekeyipokeycore;&lt;br /&gt;
 CREATE TABLE companybasekeyipokeycore AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.ipoissuer, c.ipostatecode, c.ipoissuedate&lt;br /&gt;
 FROM companybasekeysaddmaskeyaddipokeysmindateflag AS c&lt;br /&gt;
 WHERE ipokeyvalid = 1;&lt;br /&gt;
 --2312&lt;br /&gt;
&lt;br /&gt;
To check if you have the correct number of ipo and mas keys add the two counts from your query above and the count from the query below and compare it to the number of keys in your companybasemaskeycore and companybaseipocore table. In my case I get 8610 + 2312 + 83 = 2350 + 8655.  &lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM companybasekeysaddmaskeysaddipokeys&lt;br /&gt;
 WHERE ipoissuedate IS NOT NULL AND announceddate IS NOT NULL;&lt;br /&gt;
 --83&lt;br /&gt;
Now you can successfully join the companybasecore table to the ipocore and mascore tables through the companybasekeyipokeycore and companybasekeymaskeycore tables. With this step done you can create a master table which will contain information from companybase and ipo and mas.&lt;br /&gt;
==Creating companybaseipomasmaster table==&lt;br /&gt;
Before doing this stage make sure you have followed the steps in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]]&lt;br /&gt;
You will be joining the companybasecore table with the mascore and ipocore through the companybasekeyipokey and companybasekeymaskey tables. The output master table will have each company name and the dates and amounts if they received a ipo or ma. As discussed in [[#Joining companybasekeys with maskeys and ipokeys|Joining companybasekeys with maskeys and ipokeys]] the master table includes the exit deal which had the minimum date so duplicate rows should not crop up in the master table.&lt;br /&gt;
 DROP TABLE companybaseipomasmaster;&lt;br /&gt;
 CREATE TABLE companybaseipomasmaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, ipokey.ipoissuedate, maskey.announceddate AS &lt;br /&gt;
 masannounceddate, i.principalamt AS ipoprincipalamtk, m.transactionamt AS mastransactionamtk&lt;br /&gt;
 FROM companybasecore AS c  &lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybaseipomasmaster TO 'companybaseipomasmaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
You can run checks on the ipo and mas counts to make sure everything joined properly. Any duplicate keys that were not cleaned up in previous steps will make this master table a complete mess due to all the joins so make sure you've followed the process fully. Below are some of the checks I ran:&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE masannounceddate IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE mastransactionamtk IS NOT NULL; &lt;br /&gt;
 --8610&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoissuedate IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
 SELECT COUNT(*) FROM companybaseipomasmaster WHERE ipoprincipalamtk IS NOT NULL; &lt;br /&gt;
 --2312&lt;br /&gt;
Everything looks good. These counts are compared against the key tables and core tables built in the previous steps.&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching companybase keys to ipo keys==&lt;br /&gt;
First verify that your keys in companybasecore and ipocore are unique by using the following queries. If not following instructions in these sections [[#Cleaning the Companybase table|Cleaning the Companybase table]] and [[#Cleaning ipos table|Cleaning ipos table]]&lt;br /&gt;
 SELECT COUNT(*) FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore)a;&lt;br /&gt;
 --44740&lt;br /&gt;
 SELECT COUNT(*) FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT issuer, issuedate,statecode FROM ipocore)a;&lt;br /&gt;
 --9470&lt;br /&gt;
Next export the keys to a text file and put in the Matcher input folder. Run the matcher on these files. For instructions on how to use the Matcher check this out [[The Matcher (Tool)]]&lt;br /&gt;
 DROP TABLE portcokeys;&lt;br /&gt;
 CREATE TABLE portcokeys AS&lt;br /&gt;
 SELECT DISTINCT coname, statecode, datefirstinv&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeys TO 'portcokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE ipokeys;&lt;br /&gt;
 CREATE TABLE ipokeys AS&lt;br /&gt;
 SELECT issuer, statecode, issuedate&lt;br /&gt;
 FROM ipocore;&lt;br /&gt;
 --9470&lt;br /&gt;
 \COPY ipokeys TO 'ipokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Fixing Errors in the Matcher Output for portco and ipo==&lt;br /&gt;
After running the Matcher on your portcokeys and ipokeys you will notice there are some errors in the matched output file. Add flags in Excel that exclude rows where the issuedate &amp;lt; datefirstinv and where the issuedate = datefirstinv. If the exclude flag is 1 than you would want to exclude this entry from your table i.e. the issuedate &amp;gt; datefirstinv. If the flags are selected to 0, then you will want to keep this row. Also add a column for a warning flag that is 1 if the warning column is &amp;quot;Hall-Warning:Multiple&amp;quot;. Next copy this txt file into the db by creating a new table.   &lt;br /&gt;
 DROP TABLE matcherportcoipo;&lt;br /&gt;
 CREATE TABLE matcherportcoipo (&lt;br /&gt;
  warning varchar(100),&lt;br /&gt;
  file1coname varchar(100),&lt;br /&gt;
  file1statecode varchar(2),&lt;br /&gt;
  file1datefirstinv date,&lt;br /&gt;
  file2issuer varchar(100),&lt;br /&gt;
  file2statecode varchar (2),&lt;br /&gt;
  file2issuedate date,&lt;br /&gt;
  excludeflag1 int,&lt;br /&gt;
  excludeflag2 int,&lt;br /&gt;
  warningflag int&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcoipo FROM 'matcherportco-ipos.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --2592&lt;br /&gt;
You can see the &amp;quot;good&amp;quot; matches by setting all the flags to 0 as shown in the query below.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0;&lt;br /&gt;
 --2313&lt;br /&gt;
We would like to add back all the data we can so let's have a look at the rows with multiple matches.&lt;br /&gt;
 SELECT COUNT(*)&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1;&lt;br /&gt;
 --66&lt;br /&gt;
Many of the duplicates have different issuedates so we'll just select the minimum issuedate for entries where the portcokey is matched twice.&lt;br /&gt;
 DROP TABLE matcherportcoipomindate;&lt;br /&gt;
 CREATE TABLE matcherportcoipomindate AS&lt;br /&gt;
 SELECT file1coname, file1statecode, file1datefirstinv, MIN(file2issuedate)&lt;br /&gt;
 FROM matcherportcoipo&lt;br /&gt;
 WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 1&lt;br /&gt;
 GROUP BY file1coname, file1statecode, file1datefirstinv;&lt;br /&gt;
 --37&lt;br /&gt;
Then we can create an include table and union this with the good matches to create a matcher core file for portco and ipos.&lt;br /&gt;
 CREATE TABLE matcherportcoipoinclude AS&lt;br /&gt;
 SELECT m.* FROM&lt;br /&gt;
 matcherportcoipo AS m JOIN matcherportcoipomindate AS mi ON m.file1coname = mi.file1coname AND m.file1statecode = mi.file1statecode AND &lt;br /&gt;
 m.file1datefirstinv = mi.file1datefirstinv AND m.file2issuedate = mi.min; &lt;br /&gt;
 --37&lt;br /&gt;
And create a matcherportcoipocore table by combining the good matches with the fixed mismatches.&lt;br /&gt;
 CREATE TABLE matcherportcoipocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipo WHERE excludeflag1 = 0 AND excludeflag2 = 0 AND warningflag = 0&lt;br /&gt;
 UNION ALL&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM matcherportcoipoinclude;&lt;br /&gt;
 --2350&lt;br /&gt;
Now verify that the key counts are correct. The number of distinct portco keys should equal the number of rows in the core table.&lt;br /&gt;
 SELECT COUNT(*) FROM (&lt;br /&gt;
  SELECT DISTINCT file1coname, file1statecode, file1datefirstinv FROM matcherportcoipocore)a;&lt;br /&gt;
 --2350&lt;br /&gt;
Boom the matcherportcoipocore table is clean and good for use.&lt;br /&gt;
&lt;br /&gt;
==Cleaning geo table==&lt;br /&gt;
The geo table contains duplicate keys. The key for the geo table is (coname, city, startyear). Look at the different counts for all keys and distinct keys from the table:&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geo)a; &lt;br /&gt;
 --43651&lt;br /&gt;
 SELECT COUNT(*) FROM geo;&lt;br /&gt;
 --43724&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
If you look at the rows with duplicate keys you can see they are simply complete duplicates so let's create a table with just distinct rows.&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo; &lt;br /&gt;
 --43662 &lt;br /&gt;
We still have 11 keys that are not distinct. We'll need to clean those up.&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --8&lt;br /&gt;
     city      |           coname            | startyear | count&lt;br /&gt;
 --------------+-----------------------------+-----------+-------&lt;br /&gt;
 New York      | New York Digital Health LLC |      2015 |     2&lt;br /&gt;
 Portland      | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 Hauppauge     | Mdeverywhere Inc            |      1999 |     2&lt;br /&gt;
 North Mankato | Angie's Artisan Treats LLC  |      2011 |     2&lt;br /&gt;
 Cincinnati    | Undisclosed Company         |      2016 |     4&lt;br /&gt;
 New York      | Undisclosed Company         |      2015 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2016 |     2&lt;br /&gt;
 San Francisco | Undisclosed Company         |      2015 |     3&lt;br /&gt;
Modify geo1 table query to get rid of Undisclosed Companies:&lt;br /&gt;
 DROP TABLE geo1;&lt;br /&gt;
 CREATE TABLE geo1 AS&lt;br /&gt;
 SELECT DISTINCT * &lt;br /&gt;
 FROM geo&lt;br /&gt;
 WHERE coname NOT LIKE '%Undisc%';&lt;br /&gt;
 --43631&lt;br /&gt;
 SELECT *, COUNT(*) &lt;br /&gt;
 FROM (SELECT city, coname, startyear FROM geo1)a&lt;br /&gt;
 GROUP BY city, coname, startyear&lt;br /&gt;
 HAVING COUNT(*) &amp;gt; 1;&lt;br /&gt;
 --3&lt;br /&gt;
Now manually check the longitude and latitude of each of these rows and delete one of each of them. Then create your core table and verify that all the keys are distinct.&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'New York Digital Health LLC' AND city = 'New York' AND startyear = 2015 AND lattitude = 44.933143::real AND longitude = 7.540121::real;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE coname = 'Mdeverywhere Inc' AND city = 'Hauppauge' AND endyear = 2011;&lt;br /&gt;
 --1&lt;br /&gt;
 DELETE FROM geo1 WHERE city = 'North Mankato' AND lattitude = 44.19030721::real AND longitude = -94.052706::real;&lt;br /&gt;
 --1&lt;br /&gt;
 CREATE TABLE geocore AS&lt;br /&gt;
 SELECT *&lt;br /&gt;
 FROM geo1;&lt;br /&gt;
 --43628&lt;br /&gt;
 SELECT COUNT(*) FROM (SELECT DISTINCT city, coname, startyear FROM geocore)a; &lt;br /&gt;
 --43628&lt;br /&gt;
&lt;br /&gt;
==Name Based Matching geo keys to companybase keys==&lt;br /&gt;
Get a list of geokeys and companybasekeys and run them through the [[The Matcher]]. The key is (coname, city, startyear) so you'll need to extract the year from the datefirstinv from the companybasecore table. See below. &lt;br /&gt;
 DROP TABLE geokeys;&lt;br /&gt;
 CREATE TABLE geokeys AS&lt;br /&gt;
 SELECT coname, city, startyear&lt;br /&gt;
 FROM geocore&lt;br /&gt;
 WHERE noaddress = 0::boolean;&lt;br /&gt;
 --33628&lt;br /&gt;
 \COPY geokeys TO 'geokeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV &lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE portcokeysforgeo AS&lt;br /&gt;
 SELECT coname, city, EXTRACT(YEAR FROM datefirstinv)&lt;br /&gt;
 FROM companybasecore;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY portcokeysforgeo TO 'portcokeysforgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
After you run the matcher you will notice there are a ton of matching errors as usual. If you simply import this into your vcdb2 and try joining companybasecore with geocore your tables will start to explode. Notice how the line count jumps from 44,740 to 45,018.&lt;br /&gt;
 DROP TABLE matcherportcogeo;&lt;br /&gt;
 CREATE TABLE matcherportcogeo (&lt;br /&gt;
  portcoconame varchar(255),&lt;br /&gt;
  portcocity varchar(100),&lt;br /&gt;
  portcostartyear integer,&lt;br /&gt;
  geoconame varchar(255),&lt;br /&gt;
  geocity varchar(100),&lt;br /&gt;
  geodatefirstyear integer&lt;br /&gt;
 );&lt;br /&gt;
 \COPY matcherportcogeo FROM 'matcheroutputportcogeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --33608    &lt;br /&gt;
 &lt;br /&gt;
 --try matching companybase to geo through the matcherportcogeo&lt;br /&gt;
 CREATE TABLE companybasecorejoingeo AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.lattitude, g.longitude &lt;br /&gt;
 FROM companybasecore c&lt;br /&gt;
 LEFT JOIN matcherportcogeo AS m ON m.portcoconame = c.coname AND m.portcocity = c.city AND m.portcostartyear = EXTRACT(YEAR FROM &lt;br /&gt;
 c.datefirstinv)&lt;br /&gt;
 LEFT JOIN geocore AS g ON g.coname = m.geoconame AND m.geocity = g.city AND m.geodatefirstyear = g.startyear; &lt;br /&gt;
 --45018&lt;br /&gt;
Okay so we need to fix this. Luckily I already had this data in another database so I copied it out and imported it into vcdb2. The raw data can be found in a text file in the folder on the Z drive called geolookupold.txt.&lt;br /&gt;
 CREATE TABLE geoimport (&lt;br /&gt;
  coname varchar(100),&lt;br /&gt;
  statecode varchar(2),&lt;br /&gt;
  datefirstinv date,&lt;br /&gt;
  latitude real,&lt;br /&gt;
  longitude real&lt;br /&gt;
 );&lt;br /&gt;
 \COPY geoimport FROM 'geolookupold.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
 --42678  &lt;br /&gt;
&lt;br /&gt;
 SELECT COUNT(*) FROM&lt;br /&gt;
 (SELECT DISTINCT coname, statecode, datefirstinv FROM geoimport)a;&lt;br /&gt;
 --42678&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE companybasegeomaster AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.investedk, c.city, c.addr1, c.addr2, g.latitude, g.longitude&lt;br /&gt;
 FROM companybasecore AS c &lt;br /&gt;
 LEFT JOIN geoimport AS g ON c.coname = g.coname AND c.statecode = g.statecode AND c.datefirstinv = g.datefirstinv;&lt;br /&gt;
 --44740&lt;br /&gt;
 \COPY companybasegeomaster TO 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV&lt;br /&gt;
&lt;br /&gt;
==Gathering geo data from company addresses==&lt;br /&gt;
If you do not already have a file with all the geo data in it you can lookup the latitude, longitude data from google using the company address. A link on how to use the [[Geocode.py]] is found here.&lt;br /&gt;
&lt;br /&gt;
==Build dead/alive flags==&lt;br /&gt;
First find the deaddate for each company. Make sure that you have companybasecore, ipocore, mascore tables. Then calculate a deaddate. If there is no exit then the deaddate is datelastinv + 5 years. Take a look at the queries below.&lt;br /&gt;
 CREATE TABLE deaddate AS&lt;br /&gt;
 SELECT c.coname, c.statecode, c.datefirstinv, c.datelastinv, i.issuedate, m.announceddate&lt;br /&gt;
 FROM companybasecore AS c&lt;br /&gt;
 LEFT JOIN companybasekeyipokeycore AS ipokey ON c.coname = ipokey.coname AND c.statecode = ipokey.statecode AND c.datefirstinv = &lt;br /&gt;
 ipokey.datefirstinv  &lt;br /&gt;
 LEFT JOIN companybasekeymaskeycore AS maskey ON c.coname = maskey.coname AND c.statecode = maskey.statecode AND c.datefirstinv = &lt;br /&gt;
 maskey.datefirstinv&lt;br /&gt;
 LEFT JOIN ipocore AS i ON i.issuer = ipokey.ipoissuer AND i.issuedate = ipokey.ipoissuedate AND i.statecode = ipokey.ipostatecode&lt;br /&gt;
 LEFT JOIN mascore AS m ON m.targetname = maskey.mastargetname AND m.targetstatecode = maskey.masstatecode AND m.announceddate = &lt;br /&gt;
 maskey.announceddate; &lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
 CREATE TABLE deaddate1 AS&lt;br /&gt;
 SELECT *,&lt;br /&gt;
 CASE &lt;br /&gt;
 WHEN issuedate IS NULL AND announceddate IS NULL THEN datelastinv + INTERVAL '5 YEAR'&lt;br /&gt;
 WHEN issuedate IS NOT NULL THEN issuedate&lt;br /&gt;
 WHEN announceddate IS NOT NULL THEN announceddate &lt;br /&gt;
 END AS deaddate&lt;br /&gt;
 FROM deaddate;&lt;br /&gt;
 --44740&lt;br /&gt;
&lt;br /&gt;
==Creating Stage Flags Table==&lt;br /&gt;
Stage flags will be used to later on to determine if a company received seed, early or later stage financing. The growthflag is '1' if either the seed, early or later flags is '1'. The exclude flag is used to exclude all companies that received financing for activities we are not interested in and thus should be excluded from our dataset. Entries like 'Open Market Purchase', 'PIPE', etc are the things that the exclude flag filters out. It is built off the round table.&lt;br /&gt;
 DROP TABLE stageflags;&lt;br /&gt;
 CREATE TABLE stageflags AS&lt;br /&gt;
 SELECT coname, statecode, datefirstinv, rounddate, stage3,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
 END AS seedflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS earlyflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Later Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS laterflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS growthflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS transactionflag,&lt;br /&gt;
 CASE&lt;br /&gt;
  WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' &lt;br /&gt;
 OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1::int&lt;br /&gt;
  ELSE 0::int&lt;br /&gt;
  END AS excludeflag&lt;br /&gt;
 FROM round;&lt;/div&gt;</summary>
		<author><name>AdrianS</name></author>
		
	</entry>
</feed>