Changes

Jump to navigation Jump to search
11,683 bytes added ,  13:44, 21 September 2020
no edit summary
{{Project|Has project output=Data,Tool|Has sponsor=McNair ProjectsCenter
|Has title=Venture Capital (Data)
|Has owner=Adrian Smart,Jake Silberman|Has notes=|Is dependent on=|Depends upon it=|Has project status=Active, Meghana Gaur,
|Has keywords=Data
|Has project status=Subsume
}}
==NOTICE==
 
This project page is largely redundant and needs to be cleaned up. The main project pages for this project are:
[[Retrieving US VC Data From SDC]], and [[VC Database Rebuild]].
 
==Project Objective==
WHERE NOT EXISTS(SELECT * FROM ipoduplicatekeys WHERE (ipospkey.pkey = ipoduplicatekeys.pkey));
--9490
 
To verify that there are no more duplicates create another table.
DROP TABLE iposcountdupes1;
CREATE TABLE iposcountdupes1 AS
SELECT issuer, issuedate, statecode, COUNT(pkey)
FROM iponoduplicates
GROUP BY issuer, issuedate, statecode;
\COPY iposcountdupes1 TO 'iposcountdupes1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
You will notice that there are still 9 duplicate records that slipped thru somehow. Remove these manually.
DELETE FROM iponoduplicates WHERE pkey = 'PacTel Corp1993-12-02CA$299.00';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Templeton Dragon Fund Inc1994-09-21FL$169.50';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Sterling Commerce1996-03-08TX$240.00' AND proceedsamt = '288';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Sothebys Holdings Inc1988-05-13NY$27.00';
--1
DELETE FROM iponoduplicates WHERE pkey = 'TD Waterhouse Group Inc1999-06-23NY$655.20';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Berlitz International Inc1989-12-13NJ$34.65';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Spain Fund Inc1988-06-21NY$39.00';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Ultramar Corp1992-06-26CT$85.39';
--1
DELETE FROM iponoduplicates WHERE pkey = 'Goldman Sachs Group Inc1999-05-03NY$424.00';
--1
 
iponoduplicates is now the master ipo table that contains unique keys which are concatenated from issuer, issuedate and prinicpalamt. It should contain 9481 lines.
 
==Cleaning mas table==
There were duplicate records in this file because the original text file from SDC contained addresses that were on two rows for some companies. Therefore the easiest way to get rid of these records is to remove lines in textpad and reimport into the mas table.
Remove lines in textpad using regex: ^#.*\n and replacing with nothing.
After reimporting there will still be duplicate rows. Slap an id number on them and take the min as shown below.
CREATE TABLE mas1 AS
SELECT *,
concat(targetname::text, targetstate::text, announceddate::text) AS pkey
FROM mas;
ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY;
--114890
 
DROP TABLE masminid;
CREATE TABLE masminid AS
SELECT targetname, targetstate, announceddate, MIN(id)
FROM mas1
GROUP BY targetname, targetstate, announceddate;
--114825
 
DROP TABLE masnodupes;
CREATE TABLE masnodupes AS
SELECT mas1.*
FROM mas1 JOIN masminid ON mas1.id = masminid.min;
\COPY masnodupes TO 'masnodupes.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
masnodupes should have no duplicates based on targetname, targetstate, announceddate. After running it thru the matcher you will get about 800 matching errors. We decided to exclude any rows that had equivalent targetname, targetstate where the announced dates fell within the same week. In this case we took the minimum date value and excluded the other row. Below are the queries to make those tables.
 
DROP TABLE masdistinctkeys;
CREATE TABLE masdistinctkeys AS
SELECT DISTINCT targetname, targetstate, announceddate
FROM masnodupes;
--114825
\COPY masdistinctkeys TO 'masdistinctkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
DROP TABLE masdistinctkeysmindates;
CREATE TABLE masdistinctkeysmindates AS
SELECT targetname, targetstate, MIN(announceddate) AS announceddate
FROM masdistinctkeys
GROUP BY targetname, targetstate;
--113236
 
DROP TABLE masdistinctkeysdatewindow;
CREATE TABLE masdistinctkeysdatewindow AS
SELECT masdistinctkeys.*, masdistinctkeysmindates.announceddate as minanndate,
CASE WHEN masdistinctkeys.announceddate - INTERVAL '7 day' > masdistinctkeysmindates.announceddate THEN 1::int
ELSE 0::int
END AS dateflag
FROM masdistinctkeys JOIN masdistinctkeysmindates ON (masdistinctkeys.targetname = masdistinctkeysmindates.targetname AND
masdistinctkeys.targetstate = masdistinctkeysmindates.targetstate);
--114825
\COPY masdistinctkeysdatewindow TO 'masdistinctkeysdatewindow.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
CREATE TABLE masdistinctkeys1 AS
SELECT targetname, targetstate, announceddate
FROM masdistinctkeysdatewindow
WHERE dateflag = 0;
--113267
\COPY masdistinctkeys1 TO 'masdistinctkeys1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
Now when you re-run the matcher on portcokeys and masdistinctkeys1 you will get 428 errors and excluding Undisclosed fields gets you down to 356.
 
In Excel create a column to exclude any entries where the announced date occurs before the firstinvdate. Afterwards, manually flag the other duplicate mismatches picking the earlier announced date.
 
==Import Flagged Matcher Data Into PSQL==
After everything in the matcher output files is manually flagged create two new tables in Excel for iposmatcherouput and masmatcheroutput and copy those files into the tables.
CREATE TABLE masmatcheroutputwithexcludeflag(
file1coname varchar(255),
file2targetname varchar(255),
warningmsg varchar(255),
file1coname1 varchar(255),
file1statecode varchar(2),
file1datefirstinv date, --mm-dd-yyyy
file2targetname2 varchar(255),
file2statecode varchar(2),
file2announceddate date,
exclude boolean,
excludemaster boolean,
excludemanual boolean,
excludemanualmaster boolean,
excludefinal boolean
);
\COPY masmatcheroutputwithexcludeflag FROM 'masmatcheroutput-nohead.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--9400
 
CREATE TABLE ipomatcheroutputwithexcludeflag(
file1coname varchar(255),
file2issuer varchar(255),
warningmsg varchar(255),
file1coname1 varchar(255),
file1statecode varchar(2),
file1datefirstinv date, --mm-dd-yyyy
file2issuer1 varchar(255),
file2issuedate date,
file2statecode varchar(2),
excludedate boolean,
excludemaster boolean,
excludemanual boolean,
excludemanualmaster boolean,
excludefinal boolean
);
\COPY ipomatcheroutputwithexcludeflag FROM 'iposmatcheroutput-nohead.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--2591
 
==Joining portco with ipo and mas==
In order to match the company with the ipo or acquisition date we need to join the three tables together. Do this in a two step process. First join with ipo and then take the resulting table and join with mas. This will give you companies with an ipo column and acquisition column. There will be a lot of blanks because some companies didn't have ipos or acquisitions.
DROP TABLE portcoipojoin;
CREATE TABLE portcoipojoin AS
SELECT coname, statecode, datefirstinv, matcher.file2issuer, matcher.file2issuedate
FROM companybase LEFT JOIN ipomatcheroutputwithexcludeflag AS matcher ON companybase.coname = matcher.file1coname AND
companybase.statecode = matcher.file1statecode AND companybase.datefirstinv = matcher.file1datefirstinv AND
matcher.excludefinal = FALSE;
--44773
\COPY portcoipojoin TO 'portcoipojoin.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
Now join with the iposnoduplicates table to grab proceedamt and principalamt, etc.
DROP TABLE portcoipojoinamt;
CREATE TABLE portcoipojoinamt AS
SELECT portcoipojoin.*, ipo.principalamt, ipo.proceedsamt, ipo.naiccode, ipo.zipcode, ipo.status, ipo.foundeddate, ipo.pkey
FROM portcoipojoin LEFT JOIN iponoduplicates AS ipo ON portcoipojoin.file2issuer = ipo.issuer AND
portcoipojoin.file2issuedate = ipo.issuedate AND portcoipojoin.statecode = ipo.statecode;
 
DROP TABLE portcoipojoinmajoin;
CREATE TABLE portcoipojoinmajoin AS
SELECT portco.*, matcher.file2targetname, matcher.file2announceddate
FROM portcoipojoin AS portco LEFT JOIN masmatcheroutputwithexcludeflag AS matcher ON portco.coname = matcher.file1coname
AND portco.statecode = matcher.file1statecode AND portco.datefirstinv = matcher.file1datefirstinv AND matcher.excludefinal
= FALSE;
--44774
\COPY portcoipojoinmajoin TO 'portcoipojoinmajoin.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
For mas you will need to join to the masnodupes table because you will want acquiror name and transaction amount. The matcher file only contains the key without the extra information.
DROP TABLE portcomajoinmasnodupes;
CREATE TABLE portcomajoinmasnodupes AS
SELECT portcomajoin.*, masnodupes.*
FROM portcomajoin LEFT JOIN masnodupes ON masnodupes.targetname = portcomajoin.file2targetname AND
masnodupes.targetstatecode = portcomajoin.statecode AND masnodupes.announceddate = portcomajoin.file2announceddate;
--44755
 
Then join with the portco table again and extract out the city and address.
DROP TABLE portcoipomalocation;
CREATE TABLE portcoipomalocation AS
SELECT portcojoin.*, city, addr1, addr2, indclass, indsubgroup3, indminor
FROM portcoipojoinmajoin AS portcojoin LEFT JOIN companybase ON portcojoin.coname = companybase.coname AND
portcojoin.statecode = companybase.statecode;
--44878
\COPY portcoipomalocation TO 'portcoipomalocation.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
--now join the two ma and ipo tables together on coname, statecode, datefirstinv
DROP TABLE portcoipoma;
CREATE TABLE portcoipoma AS
SELECT ma.*, ipo.file2issuer, ipo.file2issuedate, ipo.principalamt, ipo.proceedsamt, ipo.naiccode, ipo.zipcode, ipo.status, ipo.foundeddate
FROM portcomajoinmasnodupes AS ma JOIN portcoipojoinamt AS ipo ON ma.coname = ipo.coname AND ma.statecode = ipo.statecode AND ma.datefirstinv = ipo.datefirstinv;
--44755
\COPY portcoipoma TO 'portcoipoma.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
DROP TABLE portcoipomanodupes;
CREATE TABLE portcoipomanodupes AS
SELECT *
FROM portcoipoma WHERE transactionamt IS NULL OR proceedsamt IS NULL;
--44706
\COPY portcoipomanodupes TO 'portcoipomanodupes.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
==Joining geo with portcoipomas==
Check the portcoipomas table first. Undisclosed Companies may have slipped through so you might have to rerun the joins against the companybase1 table selecting the appropriate flags. You will match portcoipomas with the geo data using coname, city, year so you will need to strip the year from the datefirstinv. Use the EXTRACT(YEAR FROM datefirstinv) command. I broke these into separate tables:
CREATE TABLE portcoipomalocationexclude AS
SELECT *, EXTRACT(YEAR FROM datefirstinv) AS datefirstyear
FROM portcoipomalocation AS portco
WHERE NOT EXISTS(SELECT * FROM excludeipomadupes as exclude WHERE (portco.coname = exclude.coname AND portco.statecode =
exclude.statecode AND portco.datefirstinv = exclude.datefirstinv AND exclude.excludeflag = 1));
 
CREATE TABLE companybasegeo AS
SELECT portco.*, geo.lattitude, geo.longitude, geo.noaddress
FROM portcoipomalocationexclude as portco LEFT JOIN geo ON (portco.coname = geo.coname AND portco.city = geo.city AND
portco.datefirstyear = geo.startyear);
 
You will notice that the output has blank entries for the noaddress flag. This indicates that the tables aren't joining correctly. Indeed you will see that the company Capella Systems, Inc. in the geo table is named Capella Systems Inc. in the portcoipomalocationexclude table. Therefore you will need to use the Matcher tool to match the keys in the two tables.
 
CREATE TABLE geomatcher (
geoconame varchar(255),
geocity varchar(100),
geostartyear varchar(4),
file2coname varchar(255),
file2city varchar(100),
file2datefirstyear varchar(4)
);
\COPY geomatcher FROM 'matcheroutputgeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--43770
 
Geolookupbatch1 contains the first 2299 records. Geolookupbatch2 contains 6597 records. Geolookupbatch3 contains 2527 records. This sums to 11423 which matches the number of initial lookups?

Navigation menu