Changes

Jump to navigation Jump to search
no edit summary
{{Project|Has project output=Data,How-to|Has sponsor=McNair ProjectsCenter
|Has title=VentureXpert Data
|Has owner=Augi Liebster,
|Has project status=Active
}}
 
The successors to this project include:
*[[VCDB24]], which is the most recent iteration.
*[[VCDB23]]
*[[VCDB20Q3]]
*[[VCDB20H1]]
*[[VCDB4]]
==Relevant Former Projects==
==Location==
My scripts for SDC pulls are located in the E drive in the location:
E:\McNair\Projects\VentureXpertDatabase\ScriptsForSDCExtract
My successfully pulled and normalized files are stored in the location:
E:\McNair\Projects\VentureXpertDatabase\ExtractedDataQ2
My script scripts for loading tables and data is are in one big text file in the location: E:\McNair\Projects\VentureXpertDatabase\vcdb3\LoadingScripts
There are a variety of SQL files in there with self explanatory names. The file that has all of the loading scripts is called LoadingScriptsV1. The folder vcdb2 is there for reference to see what people before had done. ExtractedData is there because I pulled data before July 1st, and Ed asked me to repull the data.
SELECT A.*, B.latitude AS BOLatitude, B.longitude AS BOLongitude FROM firmgeocore AS A LEFT JOIN bogeocore1 AS B ON A.firmname=B.firmname;
--15437
 
==Creating People Tables==
We pulled data on executives in both portcos and funds. I describe the process below. If any of the explanations don't make sense, I also describe most tables in the section called Marcos's Code.
===Company People===
DROP TABLE titlelookup;
CREATE TABLE titlelookup(
fulltitle varchar(150),
charman int,
ceo int,
cfo int,
coo int,
cio int,
cto int,
otherclvl int,
boardmember int,
president int,
vp int,
founder int,
director int
);
 
\COPY titlelookup FROM 'Important Titles in Women2017 dataset.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--628
 
This table lists various titles one can have and identifies where they fall under traditional executive titles.
 
DROP TABLE copeople;
CREATE TABLE copeople(
datefirstinv date,
cname varchar(150),
statecode varchar(2),
prefix varchar(5),
firstname varchar(50),
lastname varchar(50),
jobtitle varchar(150),
nonmanaging varchar(1),
prevpos varchar(255)
);
 
\COPY copeople FROM 'Executives-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--194359
 
This table gets various executives from portcos. This is loaded from SDC. Next we have to identify which traditional executive level job the listed job title corresponds to. It also identifies whether a prefix identifies an executive as male or female. I made a stupid mistake of writing cname instead of coname when loading in the data. If you want to save yourself work, write coname.
 
DROP TABLE copeoplebase;
CREATE TABLE copeoplebase AS
SELECT copeople.*,
CASE WHEN prefix='Ms' THEN 1::int
WHEN prefix='Mr' THEN 0::int
ELSE Null::int END AS titlefemale,
CASE WHEN prefix='Ms' THEN 0::int
WHEN prefix='Mr' THEN 1::int
ELSE Null::int END AS titlemale,
CASE WHEN prefix='Dr' THEN 1::int
ELSE 0::int END AS doctor,
CASE WHEN prefix IS NULL THEN 0::int
ELSE 1::int END AS hastitle,
CASE WHEN prefix IS NULL AND firstname IS NULL AND lastname IS NULL THEN 0::int
ELSE 1::int END AS hasperson,
CASE WHEN fulltitle IS NOT NULL THEN 1::int ELSE 0::int END AS hastitlelookup,
CASE WHEN charman IS NOT NULL THEN charman ELSE 0::int END AS chairman,
CASE WHEN ceo IS NOT NULL THEN ceo ELSE 0::int END AS ceo,
CASE WHEN cfo IS NOT NULL THEN cfo ELSE 0::int END AS cfo,
CASE WHEN coo IS NOT NULL THEN coo ELSE 0::int END AS coo,
CASE WHEN cio IS NOT NULL THEN cio ELSE 0::int END AS cio,
CASE WHEN cto IS NOT NULL THEN cto ELSE 0::int END AS cto,
CASE WHEN otherclvl IS NOT NULL THEN otherclvl ELSE 0::int END AS otherclvl,
CASE WHEN boardmember IS NOT NULL THEN boardmember ELSE 0::int END AS boardmember,
CASE WHEN president IS NOT NULL THEN president ELSE 0::int END AS president,
CASE WHEN vp IS NOT NULL THEN vp ELSE 0::int END AS vp,
CASE WHEN founder IS NOT NULL THEN founder ELSE 0::int END AS founder,
CASE WHEN director IS NOT NULL THEN director ELSE 0::int END AS director
FROM copeople
LEFT JOIN titlelookup ON copeople.jobtitle=titlelookup.fulltitle;
--194359
 
Next we will try to identify whether an executive is male or female based on their names.
 
DROP TABLE namegender;
CREATE TABLE namegender AS
SELECT firstname,
CASE WHEN countfemale > 0 AND countmale=0 THEN 1::int ELSE 0::int END AS exclusivelyfemale,
CASE WHEN countmale > 0 AND countfemale=0 THEN 1::int ELSE 0::int END AS exclusivelymale
FROM
(SELECT firstname, COALESCE(sum(titlefemale),0) as countfemale, COALESCE(sum(titlemale),0) as countmale
FROM copeoplebase WHERE doctor=0
GROUP BY firstname) As T
WHERE NOT (countfemale > 0 AND countmale>0);
--12736
 
The next table expands CoPeopleBase to include information on executive gender and executive position.
 
DROP TABLE CoPeopleFull;
CREATE TABLE CoPeopleFull AS
SELECT copeoplebase.*,
CASE WHEN titlefemale=1 THEN 1::int
WHEN exclusivelyfemale=1 THEN 1::int ELSE 0::int END AS female,
CASE WHEN titlemale=1 THEN 1::int
WHEN exclusivelymale=1 THEN 1::int ELSE 0::int END AS male,
CASE WHEN (titlefemale=1 OR titlemale=1 OR exclusivelymale=1 OR exclusivelyfemale=1) THEN 0::int ELSE 1::int END AS unknowngender,
CASE WHEN (ceo=1 OR president=1) THEN 1::int ELSE 0::int END AS ceopres,
CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1) THEN 1::int ELSE 0::int END AS CLevel,
CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1 OR director=1 OR boardmember=1) THEN 1::int ELSE 0::int END AS board,
CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1 OR director=1 OR boardmember=1 OR vp=1 OR founder=1) THEN 1::int ELSE
0::int END AS vpandabove
FROM copeoplebase
LEFT JOIN namegender ON namegender.firstname=copeoplebase.firstname
WHERE hasperson=1;
--177547
 
The next table only keeps executive listings that have a valid portco primary key associated with them.
 
DROP TABLE CoPeopleKey;
CREATE TABLE CoPeopleKey AS
SELECT A.*
FROM CoPeopleFull AS A
JOIN (SELECT firstname, lastname, cname, datefirstinv, statecode, count(*) FROM CoPeopleFull
WHERE firstname IS NOT NULL AND lastname IS NOT NULL AND cname IS NOT NULL AND datefirstinv IS NOT NULL AND statecode IS NOT NULL
GROUP BY firstname, lastname, cname, datefirstinv, statecode HAVING COUNT(*)=1) AS B
ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv=B.datefirstinv AND A.cname=B.cname AND A.statecode=B.statecode;
--176251
 
The next table identifies whether a person previously held executive positions.
 
CREATE TABLE CoPeopleSerial AS
SELECT firstname, lastname, cname, datefirstinv, statecode,
COALESCE(sum(hasperson),0) as prev,
COALESCE(sum(ceo),0) as prevceo,
COALESCE(sum(ceopres),0) as prevceopres,
COALESCE(sum(founder),0) as prevfounder,
COALESCE(sum(clevel),0) as prevclevel,
COALESCE(sum(board),0) as prevboard,
COALESCE(sum(vpandabove),0) as prevvpandabove,
CASE WHEN sum(hasperson) >=1 THEN 1::int ELSE 0::int END AS serial,
CASE WHEN sum(ceo) >=1 THEN 1::int ELSE 0::int END AS serialceo,
CASE WHEN sum(ceopres) >=1 THEN 1::int ELSE 0::int END AS serialceopres,
CASE WHEN sum(founder) >=1 THEN 1::int ELSE 0::int END AS serialfounder,
CASE WHEN sum(clevel) >=1 THEN 1::int ELSE 0::int END AS serialclevel,
CASE WHEN sum(board) >=1 THEN 1::int ELSE 0::int END AS serialboard,
CASE WHEN sum(vpandabove) >=1 THEN 1::int ELSE 0::int END AS serialvpandabove
FROM (
SELECT A.prefix, A.firstname, A.lastname, A.cname, A.datefirstinv, A.statecode,
B.cname as prevcname, B.datefirstinv as prevdatefirstinv, B.statecode as prevstatecode, B.ceo, B.ceopres, B.founder, B.clevel, B.board, B.vpandabove, B.hasperson
FROM CoPeopleKey AS A
LEFT JOIN CoPeopleKey AS B ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv > B.datefirstinv
) AS T
GROUP BY firstname, lastname, cname, datefirstinv, statecode;
--176251
 
The last table aggregates a ton of information on executives for each company. There is too much information to explain it all.
 
DROP TABLE copeopleagg;
CREATE TABLE copeopleagg AS
SELECT A.cname, A.datefirstinv, A.statecode,
sum(hasperson) as numperson,
sum(hastitle) as numtitled,
CASE WHEN sum(ceopres) >=1 THEN 1::int ELSE 0::int END AS hasceopres,
CASE WHEN sum(founder) >=1 THEN 1::int ELSE 0::int END AS hasfounder,
CASE WHEN sum(clevel) >=1 THEN 1::int ELSE 0::int END AS hasclevel,
CASE WHEN sum(board) >=1 THEN 1::int ELSE 0::int END AS hasboard,
CASE WHEN sum(vpandabove) >=1 THEN 1::int ELSE 0::int END AS hasvpandabove,
sum(female) as females,
sum(male) as males,
sum(unknowngender) as ugs,
sum(doctor*female) as femaledoctors,
sum(doctor*male) as maledoctors,
sum(doctor*unknowngender) as ugdoctors,
sum(ceopres*female) as femaleceos,
sum(ceopres*male) as maleceos,
sum(ceopres*unknowngender) as ugceos,
sum(ceopres*female*doctor) as femaledoctorceos,
sum(ceopres*male*doctor) as maledoctorceos,
sum(ceopres*unknowngender*doctor) as ugdoctorceos,
sum(founder*female) as femalefounders,
sum(founder*male) as malefounders,
sum(founder*unknowngender) as ugfounders,
sum(founder*female*doctor) as femaledoctorfounders,
sum(founder*male*doctor) as maledoctorfounders,
sum(founder*unknowngender*doctor) as ugdoctorfounders,
sum(clevel*female) as femaleclevels,
sum(clevel*male) as maleclevels,
sum(clevel*unknowngender) as ugclevels,
sum(clevel*female*doctor) as femaledoctorclevels,
sum(clevel*male*doctor) as maledoctorclevels,
sum(clevel*unknowngender*doctor) as ugdoctorclevels,
sum(board*female) as femaleboards,
sum(board*male) as maleboards,
sum(board*unknowngender) as ugboards,
sum(board*female*doctor) as femaledoctorboards,
sum(board*male*doctor) as maledoctorboards,
sum(board*unknowngender*doctor) as ugdoctorboards,
sum(vpandabove*female) as femaleabovevps,
sum(vpandabove*male) as maleabovevps,
sum(vpandabove*unknowngender) as ugabovevps,
sum(vpandabove*female*doctor) as femaledoctorabovevps,
sum(vpandabove*male*doctor) as maledoctorabovevps,
sum(vpandabove*unknowngender*doctor) as ugdoctorabovevps,
sum(prev*female) as femaleprevs,
sum(prev*male) as maleprevs,
sum(prev*unknowngender) as ugprevs,
sum(prevceopres*female) as femaleprevceopres,
sum(prevceopres*male) as maleprevceopres,
sum(prevceopres*unknowngender) as ugprevceopres,
sum(prevfounder*female) as femaleprevfounder,
sum(prevfounder*male) as maleprevfounder,
sum(prevfounder*unknowngender) as ugprevfounder,
sum(prevclevel*female) as femaleprevclevel,
sum(prevclevel*male) as maleprevclevel,
sum(prevclevel*unknowngender) as ugprevclevel,
sum(prevboard*female) as femaleprevboard,
sum(prevboard*male) as maleprevboard,
sum(prevboard*unknowngender) as ugprevboard,
sum(prevvpandabove*female) as femaleprevvpandabove,
sum(prevvpandabove*male) as maleprevvpandabove,
sum(prevvpandabove*unknowngender) as ugprevvpandabove,
sum(serial*female) as femaleserials,
sum(serial*male) as maleserials,
sum(serial*unknowngender) as ugserials,
sum(serialceopres*female) as femaleserialceopres,
sum(serialceopres*male) as maleserialceopres,
sum(serialceopres*unknowngender) as ugserialceopres,
sum(serialfounder*female) as femaleserialfounder,
sum(serialfounder*male) as maleserialfounder,
sum(serialfounder*unknowngender) as ugserialfounder,
sum(serialclevel*female) as femaleserialclevel,
sum(serialclevel*male) as maleserialclevel,
sum(serialclevel*unknowngender) as ugserialclevel,
sum(serialboard*female) as femaleserialboard,
sum(serialboard*male) as maleserialboard,
sum(serialboard*unknowngender) as ugserialboard,
sum(serialvpandabove*female) as femaleserialvpandabove,
sum(serialvpandabove*male) as maleserialvpandabove,
sum(serialvpandabove*unknowngender) as ugserialvpandabove,
sum(ceopres*serialceopres*female) as femaleceopresserialceopres,
sum(ceopres*serialceopres*male) as maleceopresserialceopres,
sum(ceopres*serialceopres*unknowngender) as ugceopresserialceopres,
sum(founder*serialfounder*female) as femalefounderserialfounder,
sum(founder*serialfounder*male) as malefounderserialfounder,
sum(founder*serialfounder*unknowngender) as ugfounderserialfounder
FROM CoPeoplekey AS A
JOIN CoPeopleSerial AS B
ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv=B.datefirstinv AND A.cname=B.cname AND A.statecode=B.statecode
GROUP BY A.cname, A.datefirstinv, A.statecode;
--30413
 
Since this table is so big, it is a good idea to have a smaller, more manageable table to work with.
 
DROP TABLE copeopleaggsimple;
CREATE TABLE copeopleaggsimple AS
SELECT cname, datefirstinv, statecode, numperson, females, males, ugs, ugdoctors, maleserials+femaleserials+ugserials AS serials
FROM copeopleagg;
--30413
 
===Fund People===
Luckily, this process is much easier than the company people process. First we must simply load the data into the db.
 
DROP TABLE fundpeople;
CREATE TABLE fundpeople(
fundname varchar(255),
fundyear int,
prefix varchar(5),
firstname varchar(50),
lastname varchar(50),
jobtitle varchar(150),
prevpos varchar(255)
);
 
\COPY fundpeople FROM 'Executives-Funds-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--328994
 
The next table identifies degree and sex information about the executives of the fund.
 
DROP TABLE fundpeoplebase;
CREATE TABLE fundpeoplebase AS
SELECT fundpeople.*,
CASE WHEN prefix='Ms' THEN 1::int
WHEN prefix='Mr' THEN 0::int
ELSE Null::int END AS titlefemale,
CASE WHEN prefix='Ms' THEN 0::int
WHEN prefix='Mr' THEN 1::int
ELSE Null::int END AS titlemale,
CASE WHEN prefix='Dr' THEN 1::int
ELSE 0::int END AS doctor,
CASE WHEN prefix IS NULL THEN 0::int
ELSE 1::int END AS hastitle,
CASE WHEN prefix IS NULL AND firstname IS NULL AND lastname IS NULL THEN 0::int
ELSE 1::int END AS hasperson
FROM fundpeople;
--328994
 
The next table tries to identify the sex of the executive using the above defined namegender table. It only selects rows where a person is actually listed.
 
DROP TABLE FundPeopleFull;
CREATE TABLE FundPeopleFull AS
SELECT fundpeoplebase.*, exclusivelyfemale, exclusivelymale,
CASE WHEN titlefemale=1 THEN 1::int
WHEN exclusivelyfemale=1 AND exclusivelymale=0 AND (titlemale=0 OR titlemale IS NULL) THEN 1::int ELSE 0::int END AS female,
CASE WHEN titlemale=1 THEN 1::int
WHEN exclusivelymale=1 AND exclusivelyfemale=0 AND (titlefemale =0 OR titlefemale IS NULL) THEN 1::int ELSE 0::int END AS male,
CASE WHEN (titlefemale=1 OR titlemale=1 OR exclusivelymale=1 OR exclusivelyfemale=1) THEN 0::int ELSE 1::int END AS unknowngender
FROM fundpeoplebase
LEFT JOIN namegender ON namegender.firstname=fundpeoplebase.firstname
WHERE hasperson=1;
--320915
 
The next table gives you information on executives aggregated by fund.
 
DROP TABLE FundPeopleAgg;
CREATE TABLE FundPeopleAgg AS
SELECT fundname,
sum(female) as numfemale,
sum(male) as nummale,
sum(unknowngender) as numunknowngender,
sum(doctor) as numdoctor,
sum(female*doctor) as numfemaledoctor,
sum(male*doctor) as nummaledoctor,
sum(unknowngender*doctor) as numunknowngenderdoctor,
sum(hastitle) as numtitled,
sum(hasperson) as numpeople,
CASE WHEN sum(hasperson) > 0 THEN sum(female)/sum(hasperson) ELSE NULL END as fracfemale,
CASE WHEN sum(male) > 0 THEN sum(female)/sum(male) ELSE NULL END as ratiofemale
FROM FundPeopleFull
GROUP BY fundname;
--21536
 
It is also good to have this information on firms. We do not pull firm people information from SDC. However, we have enough information to create it from preexisting tables.
 
DROP TABLE firmpeopleagg;
CREATE TABLE firmpeopleagg AS
SELECT _firmname as firmname, sum(numfemale) as firmwomen, sum(nummale) as firmmen, sum(numunknowngender) as firmugs,
sum(numdoctor) as firmdoctors, sum(numpeople) as firmpeople,
CASE WHEN sum(numpeople) > 0 THEN (sum(numfemale)/sum(numpeople))::real ELSE NULL END as firmfracwomen,
CASE WHEN sum(nummale) > 0 THEN (sum(numfemale)/sum(nummale))::real ELSE NULL END as firmratiowomen
FROM roundlineaggfunds AS A
JOIN fundpeopleagg AS B ON A._fundname=B.fundname
GROUP BY _firmname;
--5273
==Marcos's Code==
This is code that a Rice student, Marcos Lee, wrote. I cleaned it and ran it. I have described the tables that I built and where they come from below. My code is located in:
E:McNair\Projects\VentureXpert Database\vcdb3\LoadingScripts\MatchingEntrepsV3
 
If you have issues understanding my explanation, go to this location and read the query. Most of them are straight forward.
===Describing Stacks Created in Code===
CoPeopleBase:
-built from masteronesynth, masterreals
-combines the real and one synth table
 
==Ranking Tables and Graphs==
This is a slight detour from the creation of VCDB3. However, this is a cool process because you actually get to use the data you've been working with. This process is extensive, but the queries are easy to understand. If you wish to have deeper understanding of the process, read the code. It is located in:
 
E:McNair\Projects\VentureXpert Database\vcdb3\LoadingScripts\RoundRanking.SQL
 
First you must create a table that has aggregate round information grouped by cities and round year. Since this is a little difficult to picture, I will attach the code.
DROP TABLE roundleveloutput;
CREATE TABLE roundleveloutput AS SELECT
city, statecode, roundyear AS year,
SUM(rndamtestm*seedflag) AS seedamnt,
SUM(rndamtestm*earlyflag) AS earlyamnt,
SUM(rndamtestm*laterflag) AS lateramnt,
SUM(rndamtestm*growthflag) AS selamnt,
SUM(growthflag*dealflag) AS numseldeals
FROM round GROUP BY city, statecode, roundyear;
--30028
 
Next create a table that lists the all time SEL amount by city. Keep including the state code since this will ensure that you have the right city. City names are often repeated in different states. Next, create a table which lists unique city, state for every year since 1980. Then, build a table which matches portcos to the city, state, year blowout table for each year they were alive. This table should be relatively large since it lists companies once for every year they were alive up until the present. Then create a table that displays the number of companies alive in a city every year since 1980. Then add in a table that lists all of the information you have built in tables previously based on city, state, year. Also add in population. Then you can run the ranking queries.
 
For states follow the same general process but group by states not cities and states.
 
If this explanation was not enough for you (it was not meant to be in depth) go to the location defined above and read the actual code. With the description I have given, you should be able to piece together what each query does.
==Master Tables==

Navigation menu