Changes

Jump to navigation Jump to search
no edit summary
{{Project|Has project output=Data,How-to|Has sponsor=McNair ProjectsCenter
|Has title=VentureXpert Data
|Has owner=Augi Liebster,
}}
The successors to this project include:*[[Augi Liebster (Work Log)VCDB24]], which is the most recent iteration.*[[VCDB23]]*[[VCDB20Q3]]*[[VCDB20H1]]*[[VCDB4]]
==Relevant Former Projects==
==Location==
My scripts for SDC pulls are located in the E drive in the location:
E:\McNair\Projects\VentureXpertDatabase\ScriptsForSDCExtract
My successfully pulled and normalized files are stored in the location:
E:\McNair\Projects\VentureXpertDatabase\ExtractedDataQ2
My script scripts for loading tables and data is are in one big text file in the location: E:\McNair\Projects\VentureXpertDatabase\vcdb3\LoadingScripts
There are a variety of SQL files in there with self explanatory names. The file that has all of the loading scripts is called LoadingScriptsV1. The folder vcdb2 is there for reference to see what people before had done. ExtractedData is there because I pulled data before July 1st, and Ed asked me to repull the data.
datefirstinv date,
rounddate date,
disclosedamt moneyreal, investor fundname varchar(255)
);
==Cleaning Companybase, Fundbase, Firmbase, and BranchOffice==
===Cleaning Company===
First clean companybase. Here the The primary key for port cos will be coname, datefirstinv, and statecode. Before checking whether this is a valid primary key, remove the undisclosed companies. I will explain the second part of the query concerning New York Digital Health later.
DROP TABLE companybasecore;
SELECT * FROM (SELECT coname, datefirstinv, statecode FROM companybase) as key GROUP BY coname, datefirstinv, statecode HAVING COUNT(key) > 1;
The company named 'New York Digital Health LLC' came up as the company that is causing the problems. I queried to find the two rows that list this company name in companybase and chose to keep the row that had a lower updateddatethe earlier updated date. It is a good practice to avoid deleting rows from tables when possible, so I added the filter as a WHERE clause to exclude one of the New York Digital listings. ===Cleaning Fundbase===The primary key for funds will be only the fundname. First get rid of all of the undisclosed funds.   DROP TABLE fundbasenound; CREATE TABLE fundbasenound AS SELECT DISTINCT * FROM fundbase WHERE fundname NOT LIKE '%Undisclosed Fund%'; --28886  SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbasenound)a; --28833 As you can see, fundbase still has rows that share fundnames. If you are wondering why the DISTINCT in the first query did not eliminate these, it is because this DISTINCT applies to the whole row not individual fundnames. Thus, only completely duplicate rows will be eliminated in the first query. I chose to keep the funds that have the earlier last investment date.   DROP TABLE fundups; CREATE TABLE fundups AS SELECT fundname, max(lastinvdate) AS lastinvdate FROM fundbasenound GROUP BY fundname HAVING COUNT(*)>1; --53  DROP TABLE fundbasecore; CREATE TABLE fundbasecore AS SELECT A.* FROM fundbasenound AS A LEFT JOIN fundups AS B ON A.fundname=B.fundname AND A.lastinvdate=B.lastinvdate WHERE B.fundname IS NULL AND B.lastinvdate IS NULL; --28833 Since the count of fundbasecore is the same as the number of distinct fund names, we know that the fundbasecore table is clean. In the first query I am finding duplicate rows and choosing the row that has the greater last investment date. I then match this table back to fundbasenound but choose all the rows from fundbasecore for which there is no corresponding fund in fundups based on fund name and date of last investment. This allows the funds with the earlier date of last investment to be chosen. ===Cleaning Firmbase===The primary key for firms will be firm name. First I got rid of all undisclosed firms. I also filtered out two firms that have identical firm names and founding dates. The reason for this is because I use founding dates to filter out duplicate firm names. If there are two rows that have the same firm name and founding date, they will not be filtered out by the third query below. Thus, I chose to filter those out completely.  DROP TABLE firmbasenound; CREATE TABLE firmbasenound AS SELECT DISTINCT * FROM firmbase WHERE firmname NOT LIKE '%Undisclosed Firm%' AND firmname NOT LIKE '%Amundi%' AND firmname NOT LIKE '%Schroder Adveq Management%'; --15452  SELECT COUNT(*) FROM(SELECT DISTINCT firmname FROM firmbasenound)a; --15437 Since these counts are not equal we will have to clean the table further. We will use the same method from before.  DROP TABLE firmdups; CREATE TABLE firmdups AS SELECT firmname, max(foundingdate) as foundingdate FROM firmbasenound GROUP BY firmname HAVING COUNT(*)>1; --15  DROP TABLE firmbasecore; CREATE TABLE firmbasecore AS SELECT A.* FROM firmbasenound AS A LEFT JOIN firmdups AS B ON A.firmname=B.firmname AND A.foundingdate=B.foundingdate WHERE B.firmname IS NULL AND B.foundingdate IS NULL; --15437 Since the count of firmbasecore and the DISTINCT query are the same, the firm table is now clean. ===Cleaning Branch Offices===When cleaning the branch offices, I had to remove all duplicates in the table. This is because the table is so sparse that often the only data in a row would be the fund name the branch was associated with. Thus, I couldn't filter based on dates as I had been doing previously for firms and funds. The primary key is firm name.  DROP TABLE bonound; CREATE TABLE bonound AS SELECT *, CASE WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int ELSE 0::int END AS undisclosedflag FROM branchoffices; --10353  SELECT COUNT(*) FROM(SELECT DISTINCT firmname FROM bonound)a; --10042 Since these counts aren't the same, we will have to work a little more to clean the table. As stated above, I did this by excluding the firm names that were duplicated.  DROP TABLE branchofficecore; CREATE TABLE branchofficecore AS SELECT A.* FROM bonound AS A JOIN ( SELECT bonound.firmname, COUNT(*) FROM bonound GROUP BY firmname HAVING COUNT(*) =1 ) AS B ON A.firmname=B.firmname WHERE undisclosedflag=0; --10032  SELECT COUNT(*) FROM (SELECT DISTINCT firmname FROM branchofficecore)a; --10032 Since these counts are the same, we are good to go. The count is 10 lower because we completely removed 10 firmnames from the listing by throwing out the duplicates.
==Instructions on Matching PortCos to Issuers and M&As From Ed==
===Company Standardizing===
Get portco keys
Now prepare to repeat that process ===MA Cleaning and Matching===First remove all of the duplicates in the MA data. Do this by running aggregate queries on every column except for M&A's and IPOsthe primary key:*For M&As your keys DROP TABLE MANoDups; CREATE TABLE MANoDups AS SELECT targetname, targetstate, announceddate, min(effectivedate) AS effectivedate, MIN(acquirorname) as acquirorname, MIN(acquirorstate) as acquirorstate, MAX(transactionamt) as transactionamt, MAX(enterpriseval) as enterpriseval, MIN(for nowacquirorstatus) will be as acquirorstatus FROM mas GROUP BY targetname, statecodetargetstate, announceddate ORDER BY targetname, dateannounced targetstate, announceddate; --119374  SELECT COUNT(*For IPOs your keys ) FROM(for nowSELECT DISTINCT targetname, targetstate, announceddate FROM manodups) will be issuernamea; --119374 Since these counts are equivalent, the data set is clean. Then get all the primary keys from the table and copy the distinct target names into a text file.  DROP TABLE makey; CREATE TABLE makey AS SELECT targetname, statecodetargetstate, issuedateannounceddate*FIRST CLEAN EACH DATASET FROM manodups; --119374  \COPY (SELECT DISTINCT targetname FROM makey) TO 'DistinctTargetName. The easiest way to remove duplicates (if you have lots txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV; --117212 After running this list of them) is to use an aggregate query:distinct target names through the matcher, put the standardized MA list into the data base.
DROP TABLE IPOCoreNoDupsMaStd; CREATE TABLE IPOCoreNoDups asMaStd ( targetnamestd varchar(255), targetname varchar(255), norm varchar(100), x1 varchar(255), x2 varchar(255) );  \COPY mastd FROM 'DistinctTargetName.txt-DistinctTargetName.txt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --117212 Then match the list of standardized names back to the makey table to get a table with standardized keys and primary keys. This will be your input for matching against port cos.   DROP TABLE makeysstd; CREATE TABLE makeysstd AS SELECT B.targetnamestd, A.* FROM makey AS A JOIN mastd AS B ON A.targetname=B.targetname; --119374  \COPY makeysstd TO 'MAMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --119374 Use this text file to match against the PortCoMatchInput. Your job will be to determine whether the matches between the MAs and PortCos are true matches. The techniques that I used are described in the section below. ===IPO Cleaning and Matching===The process is the same for IPOs.  DROP TABLE iponodups; CREATE TABLE iponodups AS SELECT issuernameissuer, statecode, issuedate, maxMAX(var1principalamt) AS principalamt, MAX(proceedsamt) AS proceedsamt, MIN(naiccode) as var1naicode, MIN(zipcode) AS zipcode, avgMIN(var2status) as var2AS status, ... MIN(foundeddate) AS foundeddate FROM IPOCore ipos GROUP BY issuernameissuer, statecode, issuedate ORDER BY issuernameissuer, statecode, issuedate; --11149  SELECT COUNT(*) FROM(SELECT DISTINCT issuer, statecode, issuedate FROM iponodups)a; --11149  DROP TABLE ipokeys; CREATE TABLE ipokeys AS SELECT issuer, statecode, issuedate FROM iponodups; --11149  \COPY (SELECT DISTINCT issuer FROM ipokeys) TO 'IPODistinctIssuer.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --10803  DROP TABLE ipokeysstd; CREATE TABLE ipokeysstd ( issuerstd varchar(255), issuer varchar(255), norm varchar(100), x1 varchar(255), x2 varchar(255) );
Note that you need all vars to be inside aggregates and that you should choose the aggregate function sensibly by looking at the data\COPY ipokeysstd FROM 'IPODistinctIssuer. Generally use MAX for amounts and MIN for datestxt-IPODistinctIssuer. You can also use MAX or MIN on text stringstxt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --10803
And now build the same stacks as before but to create Issuerkeystd and TargetKeystd (or whatever you call them) DROP TABLE ipostd; CREATE TABLE ipostd AS SELECT B. Make sure that issuerstd (and targetnamestd) is in the first column, A.* FROM ipokeys AS A JOIN ipokeysstd AS B ON A.issuer=B.issuer; --11149
Now match Portcokeystd to Issuerkeystd, and match Portcokeystd to Targetkeystd \COPY ipostd TO 'IPOMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV*Move the files into the input director as before*Run the matcher script but WITHOUT mode 2: --11149
perl Matcher.pl -file1="PortCoMatchInput.txt" -file2="IssuerMatchInput.txt" perl Matcher.pl -file1="As with MA, match this file against the PortCoMatchInputfile without mode 2.txt" -file2="TargetMatchInput.txt" Open each of these files in excel and mark good matches with 1s and bad Then manually check the matches with 0s by adding columns to compare dates, states, etc, and filteringusing the techniques described below.
When you are done: *Build a new sheet of just good matchesI generally use MAX for amounts and MIN for dates.*Save the excel files*Copy each of your match sheets I also chose to a use MIN on text file*CREATE TABLE to reflect the data you are going to load (include std names and keys)*\COPY the data (using the exact copy command above but changing the table and file names) into the table*Celebrate!*Next we'll deal with any firms that have an IPO and an M&A and decide which we'll keep*And then we'll join in the chosen IPO and M&A data and move on!strings.
==Cleaning IPO and MA Data==
==Process For Creating the PortCoExits Table==
Even if you ===MA Process===First we must load the clean, manually checked tables back into the excel sheet for Warning Multiple Matches with the Hall warningdatabase. DROP TABLE MAClean; CREATE TABLE MAClean ( conamestd varchar(255), targetnamestd varchar(255), method varchar(100), x1 varchar(255), coname varchar(255), statecode varchar(10), datefirstinv date, x2 varchar(255), targetname varchar(255), targetstate varchar(10), announceddate date );  \COPY MAClean FROM 'MAClean.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --7205  SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM MAClean)a; --7188 As you can see there are still may be duplicatesduplicate primary keys in the table. To get rid of these specify that I wrote a query where you join a table of only singular that chooses primary keys to the original tablethat occur only once and matches them against MANoDups. That way you will have unique primary keys by construction.
There are two companies that have the name Masspower in the MAClean file. One is written in all caps and will thus not be caught by an aggregate function. I will select only the companies where the primary keys occurs once and join this to MAClean. I will then select needed info from MANoDps.
DROP TABLE MACleanNoDups;
CREATE TABLE MACleanNoDups AS
) AS B
ON A.targetname=B.targetname AND A.targetstate=B.targetstate AND A.announceddate=B.announceddate
LEFT JOIN MANoDups AS C ON A.targetnametargetnamestd=C.targetname AND A.targetstate=C.targetstate AND A.announceddate=C.announceddate;--7171
SELECT COUNT(*) FROM(SELECT DISTINCT coname, statecode, datefirstinv FROM MACleanNoDups)a;
Now do the same for the IPOs.
===IPO Process===
DROP TABLE IPOClean;
CREATE TABLE IPOClean (
conamestd varchar(255),
issuernamestd varchar(255),
method varchar(100),
x1 varchar(255),
coname varchar(255),
statecode varchar(10),
datefirstinv date,
x2 varchar(255),
issuername varchar(255),
issuerstate varchar(10),
issuedate date
);
\COPY IPOClean FROM 'IPOClean.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--2146
 
SELECT COUNT(*) FROM(SELECT DISTINCT issuername, issuerstate, issuedate FROM IPOClean)a;
--2141
 
As with the MA process, there were duplicates in the clean IPO table. Get rid of these using the same process as with MAs. Only choose the primary keys that occur once and join these to the IPONoDups table.
DROP TABLE IPOCleanNoDups;
CREATE TABLE IPOMAForReview
SELECT A.*, B.targetname, B.targetstate, B.announcedate
FROM IPOClean IPOCleanNoDups AS A JOIN MAClean MACleanNoDups AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;
--92
--7079
Since 21412136-92=2049 2044 and 71887171-92=70967079, we know that the duplicate companies were extracted successfully. I then created a column that identifies whether a company underwent an MA or an IPO. A 0 indicates an MA and a 1 indicates an IPO. DROP TABLE ExitKeysCleanMA; CREATE TABLE ExitKeysCleanMA AS SELECT *, CASE WHEN issuername IS NULL AND issuerstate IS NULL AND issuedate IS NULL THEN 0 ELSE NULL END AS MAvsIPO FROM exitkeyscleanmanoexit; --7096
DROP TABLE ExitKeysCleanIPO; CREATE TABLE ExitKeysCleanIPO AS SELECT *, CASE WHEN targetname IS NULL AND targetstate IS NULL AND announceddate IS NULL THEN 1 ELSE NULL END AS MAvsIPO FROM exitkeyscleaniponoexit as A; --2049 I then wrote a query to check which whether the IPO issue date or announced date of the MA was lower earlier and used that to indicate whether I chose the company to have undergone an MA or an IPO in the column MSvsIPO(I chose based on which process came first). A 0 in the column represented an MA being chosen and a 1 represented an IPO being chosen. DROP TABLE IPOMASelected; CREATE TABLE IPOMASelected AS SELECT *, CASE WHEN issuedate < announceddate THEN 1 ELSE 0 END AS MAvsIPO FROM IPOMAForReview; --92
DROP TABLE MASelected;
CREATE TABLE MASelected AS
SELECT A.coname, A.statecode, A.datefirstinv, B. targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo 0::int as IPOvsMA FROM IPOMASelected AS B LEFT JOIN companybasecore AS A ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv IPOMAForReview WHERE mavsipoissuedate >=0announceddate;
--25
DROP TABLE IPOSelected;
CREATE TABLE IPOSelected AS
SELECT A.coname, A.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B. issuername, B.issuerstate, B.issuedate, B.mavsipo 1::int as IPOvsMA FROM IPOMASelected AS B LEFT JOIN companybasecore AS A ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv IPOMAForReview WHERE mavsipo=1issuedate < announceddate;
--67
I then put together all of made the ExitKeysClean table using the portco primary key and the indicator MAvsIPO indicator column.  DROP TABLE ExitKeys; CREATE TABLE ExitKeys AS SELECT coname, statecode, datefirstinv, ipovsma FROM IPONoConflict UNION SELECT coname, statecode, datefirstinv, ipovsma FROM IPOSelected UNION SELECT coname, statecode, datefirstinv, ipovsma FROM MANoConflict UNION SELECT coname, statecode, datefirstinv, ipovsma FROM MASelected; --9215 ==Create the PortCoExit And PortCoAliveDead Tables==From consulting with Ed and the IPOs that VC Database Rebuild wiki, I selected into one decided to make the PortCoExit table with an mavsipo, an exitdate, an exited, and all of an exitvalue column. I use the MAs MAvsIPO column to add in data. It is very important that you have constructed this column. DROP TABLE PortCoExit; CREATE TABLE PortCoExit AS SELECT A.coname, A.statecode, A.datefirstinv, A.datelastinv, A.city, B.ipovsma, CASE WHEN B.ipovsma IS NOT NULL THEN 1::int ELSE 0::int END AS Exit, CASE WHEN B.ipovsma=1 THEN C.proceedsamt::numeric WHEN ipovsma=0 THEN D.transactionamt::numeric ELSE NULL::numeric END AS exitvaluem, CASE WHEN B.ipovsma=1 THEN C.issuedate WHEN ipovsma=0 THEN D.announceddate ELSE NULL::date END AS exitdate, CASE WHEN B.ipovsma=1 THEN extract(year from C.issuedate) WHEN ipovsma=0 THEN extract(year from D.announceddate) ELSE NULL::int END AS exityear FROM companybasecore AS A LEFT JOIN ExitKeys AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv LEFT JOIN IPOCleanNoDups AS C ON A.coname=C.coname AND A.statecode=C.statecode AND A.datefirstinv=C.datefirstinv LEFT JOIN MACleanNoDups AS D ON A.coname=D.coname AND A.statecode=D.statecode AND A.datefirstinv=D.datefirstinv; --48001 I selected into another then used this tableto build one that has information as to whether a company was dead or alive. I did found this information by checking whether a company had undergone an IPO or MA, indicating the company was dead. Alternatively, if the company's date of last investment was more than 5 years ago, I marked the company as dead.  DROP TABLE PortCoAliveDead; CREATE TABLE PortCoAliveDead AS SELECT *, datefirstinv as alivedate, extract(year from datefirstinv) as aliveyear, CASE WHEN exitdate IS NOT NULL then exitdate WHEN exitdate IS NULL AND (datelastinv + INTERVAL '5 year') < '7/1/2018' THEN (datelastinv + INTERVAL '5 year') ELSE NULL::date END AS deaddate, CASE WHEN exitdate IS NOT NULL then exityear WHEN exitdate IS NULL AND (datelastinv + INTERVAL '5 year') < '7/1/2018' THEN extract(year from (datelastinv + INTERVAL '5 year')) ELSE NULL::int END AS deadyear FROM PortCoExit; --48001 ==GeoCoding Companies, Firms, and Branch Offices==A helpful page here is the [[Geocode.py]] page which explains how to use the Geocoding script. You will have to tweak the Geocode script when geocoding as each of these tables has a different primary key. It is vital that you include the primary keys in the file you input and output from the Geocoding script. Without these, you will not be able to join the latitudes and longitudes back to the firm, branch office, or company base tables. Geocoding costs money since we are using UNION statementsthe Google Maps API. The process doesn't cost much, but in order to save money I did this because tried to salvage as much of the preexisting geocode information I could find.===Companies===I didnfound the table of old companies with latitudes and longitudes in vcdb2 and loaded these into vcdb3.  DROP TABLE oldgeocords; CREATE TABLE oldgeocords ( coname varchar(255), statecode varchar(10), datefirstinv date, ivestedk real, city varchar(255), addr1 varchar(255), addr2 varchar(100), latitude numeric, longitude numeric );  \COPY oldgeocords FROM 'companybasegeomaster.txt' WITH DELIMITER AS E'\t want ' HEADER NULL AS '' CSV --44740 The API occasionally will give erroneous latitude and longitude readings. In order to duplicate any IPOs catch only the good ones, I found the latitude and longitude lines that encompass the mainland US and created an exclude flag to make sure companies were in this box. I then created flags to include companies in Puerto Rico, Hawaii, and UNION acts Alaska. Companies that were in these places often had wrong latitude and longitude readings of 44.93, 7.54, so I ran a query making sure that these weren't listed.   DROP TABLE geoallcoords; CREATE TABLE geoallcoords AS SELECT *, CASE WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 0::int END AS excludeflag FROM oldgeocords; --44740  DROP TABLE geoallcoords1; CREATE TABLE geoallcoords1 AS SELECT *, CASE WHEN statecode='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag, CASE WHEN statecode='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag, CASE WHEN statecode='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as a akflag FROM geoallcoords; --44740 I then included only companies that were either in the mainland US, Hawaii, Alaska, or Puerto Rico.   DROP TABLE goodgeoold; CREATE TABLE goodgeoold AS SELECT DISTINCT statement A.*, B. Thus if latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM companybasecore AS A LEFT JOIN geoallcoords1 AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1; --38498 I then found the number of remaining companies that needed to be geocoded. Only companies that have addresses listed are able to be accurately geocoded. If we attempt to geocode based on city, the two tables added together equaled location returned will simply be the final count center of the tablecity. Thus, I chose the companies that we did not already have any duplicate rowslistings for and had a valid address.  DROP TABLE SelectedIPOSAllremaininggeo; CREATE TABLE SelectedIPOSALL remaininggeo AS SELECT A.coname, A.statecode, A.datefirstinv, A.targetnameaddr1, A.targetstateaddr2, A.announceddatecity, A.issuernamezip FROM companybasecore AS A LEFT JOIN goodgeoold AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE B.coname IS NULL AND A.addr1 IS NOT NULL; --5955  \COPY remaininggeo TO 'RemainingGeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --5955 I copied this table into excel to concatenate the address, city, state, and zipcode columns into one column. This can and should be done in SQL, but I was not aware this could be done. I then ran remaininggeo through the Geocode script with columns coname, statecode, datefirstinv, and address in the inputted file.  DROP TABLE remaining; CREATE TABLE remaining ( coname varchar(255), statecode varchar(10), datefirstinv date, latitude numeric, longitude numeric );  \COPY remaining FROM 'RemainingLatLong.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --5955 I then ran the same geographical checks on the newly geocoded companies and found all of the good geocodes.   DROP TABLE geoallcoords2; CREATE TABLE geoallcoords2 AS SELECT *, CASE WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 0::int END AS excludeflag FROM remaining; --5955  DROP TABLE geoallcoords3; CREATE TABLE geoallcoords3 AS SELECT *, CASE WHEN statecode='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int 0::int END as prflag, CASE WHEN statecode='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag, CASE WHEN statecode='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as akflag FROM geoallcoords2; --5955  DROP TABLE goodgeonew; CREATE TABLE goodgeonew AS SELECT A.issuerstate*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM companybasecore AS A LEFT JOIN geoallcoords3 AS B ON A.issuedate, coname=B.coname AND A.statecode=B.statecode AND A.mavsipo FROMdatefirstinv=B.datefirstinv WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1; --5913 I then combined the old and new geocodes and matched them back to the company base table to get a geo table for companies.  DROP TABLE geocodesportco; ExitKeysCleanIPO CREATE TABLE geocodesportco AS SELECT A.* from goodgeonew
UNION
SELECT IPOSelectedB.* from goodgeoold; --44411  DROP TABLE portcogeo; CREATE TABLE portcogeo AS SELECT A.*, B.latitude, B.longitude FROMcompanybasecore AS A LEFT JOIN Geocodesportco AS B ON A.coname=B.coname AND A.datefirstinv=B.datefirstinv AND A.statecode=B.statecode; IPOSelected--48001 ===Firms===This process is largely the same as for companies. I found old firms that had already been geocoded and checked for accuracy.  DROP TABLE oldfirmcoords; CREATE TABLE oldfirmcoords ( firmname varchar(255), latitude numeric, longitude numeric ); \COPY oldfirmcoords FROM 'FirmCoords.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV --5556  DROP TABLE firmoldfilter; CREATE TABLE firmoldfilter AS SELECT *, CASE WHEN longitude < -125 OR longitude > -211666 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 0::int END AS excludeflag FROM oldfirmcoords; --Makes sense 5556 Since oldfirmcoords does not have state codes, we have to find a way to include state codes to add in companies based in Puerto Rico, Hawaii, and Alaska. I did this by matching the firmoldfilter table back to the firm base table. I used the coalesce function because 2049+67we wanted to exclude companies that we had not geocoded due to faulty addresses.   DROP TABLE firmcoordsmatch1; CREATE TABLE firmcoordsmatch1 AS SELECT A.firmname, A.state, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM firmbasecore AS A LEFT JOIN firmoldfilter AS B ON A.firmname=2116B.firmname; --15437 Then the process of tagging the PR, HI, and AK companies and including only correctly tagged companies is the same as for companies.   DROP TABLE firmcoordsexternal; CREATE TABLE firmcoordsexternal AS SELECT *, CASE WHEN state='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag, CASE WHEN state='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag, CASE WHEN state='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as akflag FROM firmcoordsmatch1; --15437  DROP TABLE goodfirmgeoold; CREATE TABLE goodfirmgeoold AS SELECT A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM firmcoreonedupremoved AS A LEFT JOIN firmcoordsexternal AS B ON A.firmname=B.firmname WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1; --5346
Find the remaining firms and run the geocode script on these firms  DROP TABLE SelectedMASAllremainingfirm; CREATE TABLE SelectedMASALL remainingfirm AS SELECT A.conamefirmname, A.statecodeaddr1, A.datefirstinvaddr2, A.targetnamecity, A.targetstatestate, A.announceddatezip FROM firmcoreonedupremoved AS A LEFT JOIN goodfirmgeoold AS B ON A.firmname=B.firmname WHERE B.firmname IS NULL AND A.addr1 IS NOT NULL AND A.msacode!='9999'; --706  \COPY remainingfirm TO 'FirmGeoRemaining.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV --706  DROP TABLE firmremainingcoords; CREATE TABLE firmremainingcoords( firmname varchar(255), latitude numeric, longitude numeric );  \COPY firmremainingcoords FROM 'FirmRemainingCoords.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV --706 Follow the same filtering process as above to get the good geocodes.   DROP TABLE firmnewfilter; CREATE TABLE firmnewfilter AS SELECT *, CASE WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 0::int END AS excludeflag FROM firmremainingcoords; --706  DROP TABLE firmcoordsmatch2; CREATE TABLE firmcoordsmatch2 AS SELECT A.issuernamefirmname, A.issuerstatestate, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM firmcoreonedupremoved AS A LEFT JOIN firmnewfilter AS B ON A.issuedatefirmname=B.firmname; --15437  DROP TABLE firmcoordsexternalremaining; CREATE TABLE firmcoordsexternalremaining AS SELECT *, CASE WHEN state='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag, CASE WHEN state='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag, CASE WHEN state='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as akflag FROM firmcoordsmatch2; --15437  DROP TABLE goodfirmgeonew; CREATE TABLE goodfirmgeonew AS SELECT A.mavsipo *, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROMfirmcoreonedupremoved AS A LEFT JOIN firmcoordsexternalremaining AS B ExitKeysCleanMA ON A.firmname=B.firmname WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1; --703 Combine the old and new geocoded firms and match them to firm base to get a firm geo table.  DROP TABLE firmgeocoords; CREATE TABLE firmgeocoords AS A SELECT * FROM goodfirmgeonew
UNION
SELECT MASelected* FROM goodfirmgeoold; --6049  DROP TABLE firmgeocore; CREATE TABLE firmgeocore AS SELECT A.* , B.latitude, B.longitude FROM firmbasecore AS A LEFT JOIN firmgeocoords AS B ON A.firmname=B.firmname; --15437 ===Branch Offices===I did not use old branch office data because I could not find it anywhere in the old data set. I have since found old data in the table firmbasecoords in vcdb2.  First copy all of the needed data out of the database to do geocoding.  \COPY (SELECT A.firmname, A.boaddr1, A.boaddr2, A.bocity, A.bostate, A.bozip FROMbonound AS A WHERE A.boaddr1 IS NOT NULL) TO 'BranchOffices.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --2046 Then load the data into the database and follow the same filtering process as above.  MASelectedDROP TABLE bogeo; CREATE TABLE bogeo ( firmname varchar(255), latitude numeric, longitude numeric );  \COPY bogeo FROM 'BranchOfficesGeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --71212046  DROP TABLE bogeo1; CREATE TABLE bogeo1 AS SELECT *, CASE WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 0::int END AS excludeflag FROM bogeo; --2046  DROP TABLE bomatchgeo; CREATE TABLE bomatchgeo AS SELECT A.*, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM branchofficecore AS A LEFT JOIN bogeo1 AS B ON A.firmname=B.firmname; --Makes sense because 25+709610032  DROP TABLE bogeo2; CREATE TABLE bogeo2 AS SELECT *, CASE WHEN bostate='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag, CASE WHEN bostate='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag, CASE WHEN bostate='AK' AND latitude!=44.9331 AND longitude!=71217.54012 THEN 1::int ELSE 0::int END as akflag FROM bomatchgeo; --10032 Match the correctly geocoded branch offices back to firm base to get the final table.
I then checked both of these files to make sure that their primary keys were still distinct and thus valid. DROP TABLE bogeocore1; CREATE TABLE bogeocore1 AS SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM SelectedMasAll)abogeo2 WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1; --71211161
DROP TABLE firmbogeo; CREATE TABLE firmbogeo AS SELECT COUNT(A.*) FROM(SELECT targetname, targetstateB.latitude AS BOLatitude, announceddate B.longitude AS BOLongitude FROM SelectedMasAll)afirmgeocore AS A LEFT JOIN bogeocore1 AS B ON A.firmname=B.firmname; --712115437
==Creating People Tables==We pulled data on executives in both portcos and funds. I describe the process below. If any of the explanations don't make sense, I also describe most tables in the section called Marcos's Code.===Company People=== SELECT COUNTDROP TABLE titlelookup; CREATE TABLE titlelookup( fulltitle varchar(*150) FROM(SELECT issuername, issuerstate charman int, ceo int, cfo int, coo int, cio int, cto int, otherclvl int, boardmember int, president int, vp int, issuedate FROM SelectedIPOMA founder int, director int )a; --2116
SELECT COUNT(*) \COPY titlelookup FROM(SELECT DISTINCT issuername, issuerstate, issuedate FROM SelectedIPOMA)a;'Important Titles in Women2017 dataset.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --2116628
I combined the two tables This table lists various titles one can have and checked the result to make sure it Unioned correctlyidentifies where they fall under traditional executive titles.
DROP TABLE SelectedIPOMAcopeople; CREATE TABLE SelectedIPOMA copeople( datefirstinv date, cname varchar(150), statecode varchar(2), prefix varchar(5), firstname varchar(50), lastname varchar(50), jobtitle varchar(150), nonmanaging varchar(1), prevpos varchar(255) );  \COPY copeople FROM 'Executives-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --194359 This table gets various executives from portcos. This is loaded from SDC. Next we have to identify which traditional executive level job the listed job title corresponds to. It also identifies whether a prefix identifies an executive as male or female. I made a stupid mistake of writing cname instead of coname when loading in the data. If you want to save yourself work, write coname.  DROP TABLE copeoplebase; CREATE TABLE copeoplebase AS SELECT copeople.*, CASE WHEN prefix='Ms' THEN 1::int WHEN prefix='Mr' THEN 0::int ELSE Null::int END AS titlefemale, CASE WHEN prefix='Ms' THEN 0::int WHEN prefix='Mr' THEN 1::int ELSE Null::int END AS titlemale, CASE WHEN prefix='Dr' THEN 1::int ELSE 0::int END AS doctor, CASE WHEN prefix IS NULL THEN 0::int ELSE 1::int END AS hastitle, CASE WHEN prefix IS NULL AND firstname IS NULL AND lastname IS NULL THEN 0::int ELSE 1::int END AS hasperson, CASE WHEN fulltitle IS NOT NULL THEN 1::int ELSE 0::int END AS hastitlelookup, CASE WHEN charman IS NOT NULL THEN charman ELSE 0::int END AS chairman, CASE WHEN ceo IS NOT NULL THEN ceo ELSE 0::int END AS ceo, CASE WHEN cfo IS NOT NULL THEN cfo ELSE 0::int END AS cfo, CASE WHEN coo IS NOT NULL THEN coo ELSE 0::int END AS coo, CASE WHEN cio IS NOT NULL THEN cio ELSE 0::int END AS cio, CASE WHEN cto IS NOT NULL THEN cto ELSE 0::int END AS cto, CASE WHEN otherclvl IS NOT NULL THEN otherclvl ELSE 0::int END AS otherclvl, CASE WHEN boardmember IS NOT NULL THEN boardmember ELSE 0::int END AS boardmember, CASE WHEN president IS NOT NULL THEN president ELSE 0::int END AS president, CASE WHEN vp IS NOT NULL THEN vp ELSE 0::int END AS vp, CASE WHEN founder IS NOT NULL THEN founder ELSE 0::int END AS founder, CASE WHEN director IS NOT NULL THEN director ELSE 0::int END AS director FROM copeople LEFT JOIN titlelookup ON copeople.jobtitle=titlelookup.fulltitle; --194359 Next we will try to identify whether an executive is male or female based on their names.  DROP TABLE namegender; CREATE TABLE namegender AS SELECT firstname, CASE WHEN countfemale > 0 AND countmale=0 THEN 1::int ELSE 0::int END AS exclusivelyfemale, CASE WHEN countmale > 0 AND countfemale=0 THEN 1::int ELSE 0::int END AS exclusivelymale FROM (SELECT firstname, COALESCE(sum(titlefemale),0) as countfemale, COALESCE(sum(titlemale),0) as countmale FROM copeoplebase WHERE doctor=0 GROUP BY firstname) As T WHERE NOT (countfemale > 0 AND countmale>0); --12736 The next table expands CoPeopleBase to include information on executive gender and executive position.  DROP TABLE CoPeopleFull; CREATE TABLE CoPeopleFull AS SELECT copeoplebase.*, CASE WHEN titlefemale=1 THEN 1::int WHEN exclusivelyfemale=1 THEN 1::int ELSE 0::int END AS female, CASE WHEN titlemale=1 THEN 1::int WHEN exclusivelymale=1 THEN 1::int ELSE 0::int END AS male, CASE WHEN (titlefemale=1 OR titlemale=1 OR exclusivelymale=1 OR exclusivelyfemale=1) THEN 0::int ELSE 1::int END AS unknowngender, CASE WHEN (ceo=1 OR president=1) THEN 1::int ELSE 0::int END AS ceopres, CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1) THEN 1::int ELSE 0::int END AS CLevel, CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1 OR director=1 OR boardmember=1) THEN 1::int ELSE 0::int END AS board, CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1 OR director=1 OR boardmember=1 OR vp=1 OR founder=1) THEN 1::int ELSE 0::int END AS vpandabove FROM copeoplebase LEFT JOIN namegender ON namegender.firstname=copeoplebase.firstname WHERE hasperson=1; --177547 The next table only keeps executive listings that have a valid portco primary key associated with them.   DROP TABLE CoPeopleKey; CREATE TABLE CoPeopleKey AS
SELECT A.*
FROM SelectedMASALL CoPeopleFull AS A UNIONJOIN (SELECT firstname, lastname, cname, datefirstinv, statecode, count(*) FROM CoPeopleFull WHERE firstname IS NOT NULL AND lastname IS NOT NULL AND cname IS NOT NULL AND datefirstinv IS NOT NULL AND statecode IS NOT NULL GROUP BY firstname, lastname, cname, datefirstinv, statecode HAVING COUNT(*)=1) AS B ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv=B.datefirstinv AND A.cname=B.cname AND A.statecode=B.statecode; --176251 The next table identifies whether a person previously held executive positions.  CREATE TABLE CoPeopleSerial AS SELECT firstname, lastname, cname, datefirstinv, statecode, COALESCE(sum(hasperson),0) as prev, COALESCE(sum(ceo),0) as prevceo, COALESCE(sum(ceopres),0) as prevceopres, COALESCE(sum(founder),0) as prevfounder, COALESCE(sum(clevel),0) as prevclevel, COALESCE(sum(board),0) as prevboard, COALESCE(sum(vpandabove),0) as prevvpandabove, CASE WHEN sum(hasperson) >=1 THEN 1::int ELSE 0::int END AS serial, CASE WHEN sum(ceo) >=1 THEN 1::int ELSE 0::int END AS serialceo, CASE WHEN sum(ceopres) >=1 THEN 1::int ELSE 0::int END AS serialceopres, CASE WHEN sum(founder) >=1 THEN 1::int ELSE 0::int END AS serialfounder, CASE WHEN sum(clevel) >=1 THEN 1::int ELSE 0::int END AS serialclevel, CASE WHEN sum(board) >=1 THEN 1::int ELSE 0::int END AS serialboard, CASE WHEN sum(vpandabove) >=1 THEN 1::int ELSE 0::int END AS serialvpandabove FROM ( SELECT A.prefix, A.firstname, A.lastname, A.cname, A.datefirstinv, A.statecode, B.cname as prevcname, B.datefirstinv as prevdatefirstinv, B.statecode as prevstatecode, B.ceo, B.ceopres, B.founder, B.clevel, B.board, B.vpandabove, B.hasperson FROM CoPeopleKey AS A LEFT JOIN CoPeopleKey AS B ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv > B.datefirstinv ) AS T GROUP BY firstname, lastname, cname, datefirstinv, statecode; --176251 The last table aggregates a ton of information on executives for each company. There is too much information to explain it all.   DROP TABLE copeopleagg; CREATE TABLE copeopleagg AS SELECT A.cname, A.datefirstinv, A.statecode, sum(hasperson) as numperson, sum(hastitle) as numtitled, CASE WHEN sum(ceopres) >=1 THEN 1::int ELSE 0::int END AS hasceopres, CASE WHEN sum(founder) >=1 THEN 1::int ELSE 0::int END AS hasfounder, CASE WHEN sum(clevel) >=1 THEN 1::int ELSE 0::int END AS hasclevel, CASE WHEN sum(board) >=1 THEN 1::int ELSE 0::int END AS hasboard, CASE WHEN sum(vpandabove) >=1 THEN 1::int ELSE 0::int END AS hasvpandabove, sum(female) as females, sum(male) as males, sum(unknowngender) as ugs, sum(doctor*female) as femaledoctors, sum(doctor*male) as maledoctors, sum(doctor*unknowngender) as ugdoctors, sum(ceopres*female) as femaleceos, sum(ceopres*male) as maleceos, sum(ceopres*unknowngender) as ugceos, sum(ceopres*female*doctor) as femaledoctorceos, sum(ceopres*male*doctor) as maledoctorceos, sum(ceopres*unknowngender*doctor) as ugdoctorceos, sum(founder*female) as femalefounders, sum(founder*male) as malefounders, sum(founder*unknowngender) as ugfounders, sum(founder*female*doctor) as femaledoctorfounders, sum(founder*male*doctor) as maledoctorfounders, sum(founder*unknowngender*doctor) as ugdoctorfounders, sum(clevel*female) as femaleclevels, sum(clevel*male) as maleclevels, sum(clevel*unknowngender) as ugclevels, sum(clevel*female*doctor) as femaledoctorclevels, sum(clevel*male*doctor) as maledoctorclevels, sum(clevel*unknowngender*doctor) as ugdoctorclevels, sum(board*female) as femaleboards, sum(board*male) as maleboards, sum(board*unknowngender) as ugboards, sum(board*female*doctor) as femaledoctorboards, sum(board*male*doctor) as maledoctorboards, sum(board*unknowngender*doctor) as ugdoctorboards, sum(vpandabove*female) as femaleabovevps, sum(vpandabove*male) as maleabovevps, sum(vpandabove*unknowngender) as ugabovevps, sum(vpandabove*female*doctor) as femaledoctorabovevps, sum(vpandabove*male*doctor) as maledoctorabovevps, sum(vpandabove*unknowngender*doctor) as ugdoctorabovevps, sum(prev*female) as femaleprevs, sum(prev*male) as maleprevs, sum(prev*unknowngender) as ugprevs, sum(prevceopres*female) as femaleprevceopres, sum(prevceopres*male) as maleprevceopres, sum(prevceopres*unknowngender) as ugprevceopres, sum(prevfounder*female) as femaleprevfounder, sum(prevfounder*male) as maleprevfounder, sum(prevfounder*unknowngender) as ugprevfounder, sum(prevclevel*female) as femaleprevclevel, sum(prevclevel*male) as maleprevclevel, sum(prevclevel*unknowngender) as ugprevclevel, sum(prevboard*female) as femaleprevboard, sum(prevboard*male) as maleprevboard, sum(prevboard*unknowngender) as ugprevboard, sum(prevvpandabove*female) as femaleprevvpandabove, sum(prevvpandabove*male) as maleprevvpandabove, sum(prevvpandabove*unknowngender) as ugprevvpandabove, sum(serial*female) as femaleserials, sum(serial*male) as maleserials, sum(serial*unknowngender) as ugserials, sum(serialceopres*female) as femaleserialceopres, sum(serialceopres*male) as maleserialceopres, sum(serialceopres*unknowngender) as ugserialceopres, sum(serialfounder*female) as femaleserialfounder, sum(serialfounder*male) as maleserialfounder, sum(serialfounder*unknowngender) as ugserialfounder, sum(serialclevel*female) as femaleserialclevel, sum(serialclevel*male) as maleserialclevel, sum(serialclevel*unknowngender) as ugserialclevel, sum(serialboard*female) as femaleserialboard, sum(serialboard*male) as maleserialboard, sum(serialboard*unknowngender) as ugserialboard, sum(serialvpandabove*female) as femaleserialvpandabove, sum(serialvpandabove*male) as maleserialvpandabove, sum(serialvpandabove*unknowngender) as ugserialvpandabove, sum(ceopres*serialceopres*female) as femaleceopresserialceopres, sum(ceopres*serialceopres*male) as maleceopresserialceopres, sum(ceopres*serialceopres*unknowngender) as ugceopresserialceopres, sum(founder*serialfounder*female) as femalefounderserialfounder, sum(founder*serialfounder*male) as malefounderserialfounder, sum(founder*serialfounder*unknowngender) as ugfounderserialfounder FROM SelectedIPOSALL CoPeoplekey AS A JOIN CoPeopleSerial AS B ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv=B.datefirstinv AND A.cname=B.cname AND A.statecode=B.statecode GROUP BY A.cname, A.datefirstinv, A.statecode; --923730413 Since this table is so big, it is a good idea to have a smaller, more manageable table to work with.  DROP TABLE copeopleaggsimple; CREATE TABLE copeopleaggsimple AS SELECT cname, datefirstinv, statecode, numperson, females, males, ugs, ugdoctors, maleserials+femaleserials+ugserials AS serials FROM copeopleagg; --30413 ===Fund People===Luckily, this process is much easier than the company people process. First we must simply load the data into the db.  DROP TABLE fundpeople; CREATE TABLE fundpeople( fundname varchar(255), fundyear int, prefix varchar(5), firstname varchar(50), lastname varchar(50), jobtitle varchar(150), prevpos varchar(255) );  \COPY fundpeople FROM 'Executives-Funds-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --328994 The next table identifies degree and sex information about the executives of the fund.  DROP TABLE fundpeoplebase; CREATE TABLE fundpeoplebase AS SELECT fundpeople.*, CASE WHEN prefix='Ms' THEN 1::int WHEN prefix='Mr' THEN 0::int ELSE Null::int END AS titlefemale, CASE WHEN prefix='Ms' THEN 0::int WHEN prefix='Mr' THEN 1::int ELSE Null::int END AS titlemale, CASE WHEN prefix='Dr' THEN 1::int ELSE 0::int END AS doctor, CASE WHEN prefix IS NULL THEN 0::int ELSE 1::int END AS hastitle, CASE WHEN prefix IS NULL AND firstname IS NULL AND lastname IS NULL THEN 0::int ELSE 1::int END AS hasperson FROM fundpeople; --328994 The next table tries to identify the sex of the executive using the above defined namegender table. It only selects rows where a person is actually listed.  DROP TABLE FundPeopleFull; CREATE TABLE FundPeopleFull AS SELECT fundpeoplebase.*, exclusivelyfemale, exclusivelymale, CASE WHEN titlefemale=1 THEN 1::int WHEN exclusivelyfemale=1 AND exclusivelymale=0 AND (titlemale=0 OR titlemale IS NULL) THEN 1::int ELSE 0::int END AS female, CASE WHEN titlemale=1 THEN 1::int WHEN exclusivelymale=1 AND exclusivelyfemale=0 AND (titlefemale =0 OR titlefemale IS NULL) THEN 1::int ELSE 0::int END AS male, CASE WHEN (titlefemale=1 OR titlemale=1 OR exclusivelymale=1 OR exclusivelyfemale=1) THEN 0::int ELSE 1::int END AS unknowngender FROM fundpeoplebase LEFT JOIN namegender ON namegender.firstname=fundpeoplebase.firstname WHERE hasperson=1; --320915 The next table gives you information on executives aggregated by fund.  DROP TABLE FundPeopleAgg; CREATE TABLE FundPeopleAgg AS SELECT fundname, sum(female) as numfemale, sum(male) as nummale, sum(unknowngender) as numunknowngender, sum(doctor) as numdoctor, sum(female*doctor) as numfemaledoctor, sum(male*doctor) as nummaledoctor, sum(unknowngender*doctor) as numunknowngenderdoctor, sum(hastitle) as numtitled, sum(hasperson) as numpeople, CASE WHEN sum(hasperson) > 0 THEN sum(female)/sum(hasperson) ELSE NULL END as fracfemale, CASE WHEN sum(male) > 0 THEN sum(female)/sum(male) ELSE NULL END as ratiofemale FROM FundPeopleFull GROUP BY fundname; --21536 It is also good to have this information on firms. We do not pull firm people information from SDC. However, we have enough information to create it from preexisting tables.  DROP TABLE firmpeopleagg; CREATE TABLE firmpeopleagg AS SELECT _firmname as firmname, sum(numfemale) as firmwomen, sum(nummale) as firmmen, sum(numunknowngender) as firmugs, sum(numdoctor) as firmdoctors, sum(numpeople) as firmpeople, CASE WHEN sum(numpeople) > 0 THEN (sum(numfemale)/sum(numpeople))::real ELSE NULL END as firmfracwomen, CASE WHEN sum(nummale) > 0 THEN (sum(numfemale)/sum(nummale))::real ELSE NULL END as firmratiowomen FROM roundlineaggfunds AS A JOIN fundpeopleagg AS B ON A._fundname=B.fundname GROUP BY _firmname; --5273 ==Marcos's Code==This is code that a Rice student, Marcos Lee, wrote. I cleaned it and ran it. I have described the tables that I built and where they come from below. My code is located in: E:McNair\Projects\VentureXpert Database\vcdb3\LoadingScripts\MatchingEntrepsV3 If you have issues understanding my explanation, go to this location and read the query. Most of them are straight forward. ===Describing Stacks Created in Code=== CoPeopleBase: -Builds from copeople and titlelookup -Identifies what roles people played in their companies  namegender: -built from copeoplebase -identifies male/female/unknown  CoPeopleFull: -built from copeoplebase and namegender -builds more extensive information on executive including speficially what level of executive they are  CoPeopleKey: -built from CoPeopleFull -creates table where only executives with full primary keys are kept  CoPeopleSerial: -built from copeoplekey -keeps track of executives previous jobs at executive level  CoPoepleAgg: -built from copeoplekey and copeopleserial -gets extensive information on executives for each company  FundPeopleBae: -built from fundpeople -identifies male/female/doctor -hasperson column slightly weird because we can only have the lastname without prefix or first name and still have a 1 in column. Seems to be of little use/too broad  FundPeopleFull: -built from fundpeoplebase, namegender -adds in male/female   Fundpeopleagg: -built from fundpeoplefull -has aggregations of gender info for each fund  RoundLineJoinerLeanffWlistno: -built from rounlinejoinerleanff -adds listno to funds  RoundLineAggFunds: -built from roundlinejoinerleanffwlistno and rounlineaggfirms -if there are two funds from one firm that invest in same portco, we choose only one and leave the others behind  RoundLineAggWExit: -built from roundlineaggfirms, portcoexitupdated, roundlineaggfunds -adds in exit information for each company in roundlineaggfirms  FirmPerf: -built from roundlineaggwexit -adds in various performance measures for a given firm   PortCoFundDemo: -built from roundlinejoinerleanffclean and fundpeopleagg -gives information on executives of funds who invested in the portcos  PortCoPeopleMaster: -built from PortCoMaster, PortCoIndustry, PortCoPatent, PortCoSBIR, copeoplagg, PortCoFundDemo, CPI, statelookupint -huge amount of data about companies and their executives  RoundAggDistBase: -built from portcogeo, firmbogeo, roundlineaggwexit -creates geographic points using long, lat from geocoding  RoundAggDist: -Built from roundaggdistbase -gets actual distances between portcos and firms. if branch office exists and distance is less than distance to firm chooses that also generates random number  FirmPeopleAgg: -built from roundlineaggfunds, fundpeopleagg -finds information on executives from different firms  PortCoMatchmaster: -built from portcopatent, porcoindustry, portcosbir, copeopleaggsimple, portcoid -gets all information together about portcos  FirmMatchMaster: -built from firmperf, firmvars, firmpeopleagg, firmid -gets all information together about firms  RoundLineMasterBase: -built from portcomatchmaster, firmmatchmaster, roundaggdist, roundlineaggwexit -builds large amount of information about portcos and firms spceifically info about exits and distances  MatchMostNumerous: -built from roundlinemasterbase -finds max number of portcos invested in by a firm that also invested in the company grouping by  MatchHighestRandom: -built from matchmostnumerous -if two firms that invested in one company had the same number of max port cos this randomly chooses one company  FirmActiveYearsCode20: -built from roundlinejoinerleanffclean, porcoindustry -adds firmname to industry code not exactly sure why distinct is used in query  RealMatchesCode20: -built from MatchHighestRandom, PortCoIndustry -real matches between portcos and firms that invested in them including the code20  SyntheticFirmSetBaseCode20: -built from realmatchescode20, firmactiveyarscode20 -crossproduct of firms and portcos. finds firms that invested in same year as portco received first inv, firms invested in same type of company, and makes sure matches are unique  AllMatchKeys: -built from SyntheticFirmSetBaseCode20, RealMatchesCode20 -combines synthetic and real matches  SynthRoundAggDistBaseCode20: -built from allmatchkeys, portcogeo, firmbogeo -builds points for all portco, firm listings in allmatch keys  SynthRoundAddDistCode20: -built from synthroundaggdistvasecode20 -finds actual distance between portcos and firms using installed extensions chooses branch offices if distance between portco and bo less than firm  SynthFirmnameInduBlowoutCode20: -built from allmatchkeys, roundlinemasterbase -gets every firm combination and checks whehter the companies that those firms invested in are in the same general industry  SynthFirmNameroundInduHistCode20: -built from SynthFirmnameInduBlowoutcode20 -gets information by portco, firmname match about what the firms past investment patterns are  MasterWithSynthBaseCode20Portco: -built from Allmatchkeys, matchhighestrandom, synthroundaggdistcode20, sythnfirmnameroundinduhistcode20, synthfirmnameroundindutotalcode20, firmvars, copeopleaggsimple, portcomaster -builds a bunch of information about synthetic and real matches  SynthFirmnameRoundInduTotalCode20: -built from allmatchkeys, roundlinemasterbase -finds number of portcos in certain industries by firmnames  MasterWithSynthCode20Firms: -built with firmmatchmaster, allmatchkeys -matching a bunch of information to all firms  MasterWithSynthcode20: -built from masterwithsynthbasecode20portco, masterwithsynthcode20firms -gets a huge amount of info together on real and synthetic matches about firms and companies  MasterReals: -built from masterwithsynthcode20 -gets just real matches from code
SELECT COUNT(*) FROM SelectedIPOMA WHERE mavsipo=0;MasterOneSynth: -built from masterwithsynthcode20 -7121gets just one randomly chosen synthetic match between companies and firms
SELECT COUNT(*) FROM SelectedIPOMA WHERE mavsipo=1;MasterRealOneSynth: -built from masteronesynth, masterreals -2116combines the real and one synth table
SELECT COUNT(*) FROM SelectedIPOMA WHERE targetname IS NOT NULL;==Ranking Tables and Graphs== --7188 SELECT COUNT(*) FROM SelectedIPOMA WHERE issuername IS NOT NULL; --2141This is a slight detour from the creation of VCDB3. However, this is a cool process because you actually get to use the data you've been working with. This process is extensive, but the queries are easy to understand. If you wish to have deeper understanding of the process, read the code. It is located in:
Everything seems to check out, so we can move on to joining the SelectedIPOMA table to the companybasecore table to create the ExitKeysClean table. DROP TABLE ExitKeysClean; CREATE TABLE ExitKeysClean AS SELECT A.coname, AE:McNair\Projects\VentureXpert Database\vcdb3\LoadingScripts\RoundRanking.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo FROM companybasecore AS A LEFT JOIN SelectedIPOMA AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv; --48000SQL
First you must create a table that has aggregate round information grouped by cities and round year. Since 48000 this is the same number of rows in the companybasecorea little difficult to picture, we know that I will attach the join was successful and no rows were added that we don't wantcode. DROP TABLE roundleveloutput; CREATE TABLE roundleveloutput AS SELECT city, statecode, roundyear AS year, SUM(rndamtestm*seedflag) AS seedamnt, SUM(rndamtestm*earlyflag) AS earlyamnt, SUM(rndamtestm*laterflag) AS lateramnt, SUM(rndamtestm*growthflag) AS selamnt, SUM(growthflag*dealflag) AS numseldeals FROM round GROUP BY city, statecode, roundyear; --30028
POTENTIAL CAUSE OF ERROR: I am missing 5 entries from Next create a table that lists the SelectedIPOMA WHERE mavsipo=0all time SEL amount by city. The answer Keep including the state code since this will ensure that you have the right city. City names are often repeated in different states. Next, create a table which lists unique city, state for every year since 1980. Then, build a table which matches portcos to the city, state, year blowout table for each year they were alive. This table should be 7121 but instead relatively large since it is 7116lists companies once for every year they were alive up until the present. Then create a table that displays the number of companies alive in a city every year since 1980. Then add in a table that lists all of the information you have built in tables previously based on city, state, year. Also add in population. Not sure why this is happeningThen you can run the ranking queries.
UPDATE: I have found For states follow the 5 missing entires same general process but am unsure as to why thy are group by states not being included in the ExitKeysClean table. coname | statecode | datefirstinv | targetname | targetstate | announceddate | issuername | issuerstate | issuedate | mavsipo | equivalence --------------------+-----------+--------------+-------------------+-------------+---------------+------------+-------------+-----------+---------+------------- Corbel & Company | FL | 1985-04-01 | Corbel & Co Inc | FL | 1993-03-29 | | | | 0 | 0 Deltak Corporation | IL | 1971-08-01 | Deltak Corp | MN | 1993-12-28 | | | | 0 | 0 Wine.com | CA | 1995-07-01 | Winecities and states.com | CA | 2001-04-27 | | | | 0 | 1 CHF Solutions Inc | MN | 1999-06-30 | CHF Solutions Inc | MN | 2010-01-20 | | | | 0 | 1 Packet Design Inc | CA | 2000-06-13 | Packet Design LLC | CA | 2013-03-19 | | | | 0 | 0
==Create If this explanation was not enough for you (it was not meant to be in depth) go to the PortCoExit Table==From consulting with Ed location defined above and read the actual code. With the VC Database Rebuild wikidescription I have given, I decided you should be able to make the PortCoExit table with an mavsipo, an exitdate, an exited, and an exitvalue column. DROP TABLE PortCoExit; CREATE TABLE PortCoExit AS SELECT A.coname, A.statecode, A.datefirstinv, CASE WHEN a.mavsipo=0 THEN 0::int WHEN a.mavsipo=1 THEN 1::int ELSE NULL::int END AS mavsipo, CASE WHEN a.mavsipo=0 THEN B.announceddate WHEN a.mavsipo=1 THEN C.issuedate ELSE NULL::date END AS exitdate, CASE WHEN a.mavsipo=0 OR a.mavsipo=1 THEN 1::int ELSE 0::int END AS exited, CASE WHEN a.mavsipo=0 THEN B.transactionamt WHEN a.mavsipo=1 THEN C.proceedsamt ELSE NULL::money END AS exitvalue FROM ExitKeysClean AS A LEFT JOIN MANoDups AS B ON A.targetname=B.targetname AND A.targetstate=B.targetstate AND A.announceddate=B.announceddate LEFT JOIN IPONoDups AS C ON A.issuername=C.issuer AND A.issuerstate=C.statecode AND A.issuedate=Cpiece together what each query does.issuedate; --48000
==Master Tables==Throughout the creation of the database, there are inevitably some tables that are vital to create a solid foundation. The issue that I am currently having following tables are the master tables with a quick explanation:* '''Companybasecore'''- The base table for portcos. This is data that the exitvalue column was drawn directly from SDC and was not changed other than for cleaning purposes. Count: 48001* '''BranchOfficeCore'''- The base table for MAs branch offices. This is always Null, and there data drawn directly from SDC. Here only branch offices with distinct firm names are some weird numbers in the included. Count: 10032* '''FirmBaseCore'''- The base tablefor firms. I will continue to look This is also data taken directly from SDC and was not changed other than for thesecleaning purposes. I have been continuously checking Count: 15437* '''FundBaseCore'''- The base table for duplicates funds. This is also data taken directly from SDC and validity was not changed other than for cleaning purposes. Count: 28833* '''IPOCleanNoDups''' - This is the clean table of IPOs after being run through the primary keys, so I do not believe these matcher against portcos. It was cleaned manually and had duplicates removed. Count: 2136* '''IPONoDups'''- This is the table before the cleaning process of matching to portcos. There could be problems with this table as we used an aggregate function here. Be careful using this table. Count: 11149* '''MACleanNoDups'''- This is the clean table of MAs after being run through the matcher against portcos. It was cleaned manually and had duplicates removed. Count: 7171* '''MANoDups'''- This is the table before the cleaning process of matching to portcos. There could be due problems with this table as we used an aggregate function here as well. Be careful using this table. Count: 119374* '''Round'''- This is the master round table. It has SEL flags attached to duplicatesit and has the most round info. RoundBaseClean is also a decent table but has less information. This table is your best bet for round information. Count: 151323* '''RoundLineJoinerLeanFFClean'''- This is the master round table for joining purposes. It was cleaned and used for widespread joining purposes. Count: 163157* '''CoPeople'''- This is the base table for PortCo people information. It was pulled directly from SDC. Count: 194359* '''FirmBoGeo'''- This is the base table for firm/branch office geocoding. This table was cleaned and contains lat/long readings for firms and branch offices where the information was available. Count: 15437* '''PortCoGeo'''- This is the base table for portco geocoding. Table was cleaned and contains lat/long reading for portcos where the Google API returned a valid reading. Count: 48001* '''FirmPerf'''- This is a wide reaching table about the performance of firms. It was mainly used later in the project but is extremely useful. Count: 8336* '''FundPeople'''- This is the base table for fund people information. It was pulled directly from SDC. Count: 328994.* '''PortCoExitUpdated'''- This is the master exit table for portcos. The difference between this and PortCoExit is that Updated has two columns marking MAs and IPOs while the other has one column MAvsIPO. Use which ever one is more convenient. Count: 48001* '''PortCoMaster'''- This table is great. There's a ton of information on PortCos including SEL flags, round amounts, and industry classifications. Count: 48001

Navigation menu