Changes

Jump to navigation Jump to search
no edit summary
}}
==Current WorkDataset Rebuild== In E:\projects\vcdb20\BuildBaseTables.sql === V2 === Objectives: * Generate matchinstagenarrow and matchinstagebroad* Check coverage of Xunjie's variables* Generate some alternative/additional variables: Xunjie's list of variables for estimation:* matchhqdist -- matchbodist is preferred. 152/500k nulls.* matchinstagenarrow -- Now in the dataset with improved logic. Probably use matchinstagebroad instead. No nulls.* firmfirstinvyear -- firmageatdeal is preferred. No Nulls.* matchprevindu20 -- no nulls.* matchprevportcos -- no nulls.* pcnumperson -- this is a conceptually and operationally terrible variable! See below.* pccitydollarsrankm1 -- matching on placename has issues!* pcexp -- similar issues to pcnumperson. See below. Restrictions:* I only keep data between 2002 and 2016. * And I only keep the matching markets if the number of real matches is more than or equal to 5. * This gives me 922 matching markets.   "The major issue is that there are many missing values (about 10%) for the last three variables (and possibly “matchinstagenarrow"). We need to drop the entire matching market if there is one missing value. The number of matching markets without missing values is 260. So less than 1/3 matching markets survive." Dropping the entire market is surely way too extreme. We should just drop the offending portco and only drop the market if the number of real matches drops below our threshold (e.g., 5). === pcnumperson / pcexp === pcnumperson suffers from a number of endogeneity issues, including:# Thomson adds information each time the firm recieves more investment, so pcnumperson is correlated with the number of rounds, amount invested, prob of exit, etc.# Higher quality firms/portcos are more likely to report the people in a portco. # numperson includes a broad range of titles, roughly VP-level and above with some extras, and more organized portcos may report deeper into their ranks. pcexp suffers from all of the above issues and more. pcexp has the following lineage:* portcosuper.pcexp <- prevs.PortCoPeopleMaster <- sum(prev).PortCoPeopleAgg <- COALESCE(sum(hasperson),0).PortCoPeopleSerial <-hasperson.PortCoPeopleKey <- hasperson.PortCoPeopleFull <- hasperson.PortCoPeopleBase* PortCoPeopleBase: CASE WHEN prefix IS NULL AND firstname IS NULL AND lastname IS NULL THEN 0::int ELSE 1::int END AS hasperson,* So hasperson is a record indicator, and the pcexp sums over both people and thier positions. Accordingly:** A portco with 1 person who has held two previous positions has pcexp=2** A portco with 2 people who have each held one previous position has pcexp=2   ==== match in stage ==== New logic makes 'matchinstagepref' (both narrow and broad) true when stagepref is null, the deal is sead/early and the firm is 'Ecosystem','SBIC','Angel','Gov', and otherwise when stagepref is null.  CASE WHEN (dealseed >= 1 AND (firmstageprefno=1 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (dealearly >= 1 AND (firmstageprefno=2 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (deallater >= 1 AND (firmstageprefno=4 OR firmstageprefno=5)) THEN 1::int WHEN firmstageprefno IS NULL AND firmcat IN ('Ecosystem','SBIC','Angel','Gov') AND (dealseed >= 1 OR dealearly >= 1) THEN 1::int WHEN firmstageprefno IS NULL AND firmcat IN ('Ecosystem','SBIC','Angel','Gov') AND NOT (dealseed >= 1 OR dealearly >= 1) THEN 0::int WHEN firmstageprefno IS NULL THEN 1::int ELSE 0::int END AS matchinstagenarrow, CASE WHEN (deallater >= 1 AND (firmstageprefno=1 OR firmstageprefno=2 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (dealearly >= 1 AND (firmstageprefno=1 OR firmstageprefno=2 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (deallater >= 1 AND (firmstageprefno=4 OR firmstageprefno=5)) THEN 1::int WHEN firmstageprefno IS NULL AND firmcat IN ('Ecosystem','SBIC','Angel','Gov') AND (dealseed >= 1 OR dealearly >= 1) THEN 1::int WHEN firmstageprefno IS NULL AND firmcat IN ('Ecosystem','SBIC','Angel','Gov') AND NOT (dealseed >= 1 OR dealearly >= 1) THEN 0::int WHEN firmstageprefno IS NULL THEN 1::int ELSE 0::int END AS matchinstagebroad, === V1 === *Changed MatchHighestRandom to MatchMaster. It is MatchMostNumerous (i.e., pick the firm with max(numportcos) for each portco from RLMaster) with a random tie break. It contains a lot of variables pertaining to the portco, firm, round, and match!*MatchKeys is coname, statecode, datefirstinv, firmname, as well as minroundin, year, code, code20, code100. It replaces RealMatchesCode. Code is in E:\projects\unobservedcomplementarities\BuildDataset.sql*Replaced SynRealSetc20 with SynthKeys_Code20.*Replaced AllRealMatchKeysC20Code with ComboKeys_Code20, also renamed realmatch variable to isreal.**Note that ComboKeys_Code20 (591,299 with 554,561 synths) is much smaller AllRealMatchKeysC20Code (1,631,896 with 1,599,427 synths), which allowed (almost) any other firm from RLJoinerFF that had done a deal in that code20-year. ComboKeys_Code20 only allows another real match from the same code20-year.*Rebuilt Super tables (PortCoSuper, DealSuper, FirmSuper). Note that FirmSuper is now restricted to US firms only (matches were already US-US because MatchMostNumerous was constrained to US (and state !='UN') firms and portcos only. However, it is not clear that this was true in the past.).*Rebuilt the AllMatchc20 tables. New names are Combo..._Code20. **Crucial difference: Only investments within ComboKeys_Code20 are included in the history counts. Before anything in RLJoinerFF or even RoundlineBase were included, which created the impression of overcounts.**The vast majority of missing distances were caused by missing firm addresses for just 391 VCs (some of these were state UN, and later removed). However, we have zip codes for almost all of them. I built firmbogeoplus (see [[VCDB20]]) to add in zcta centroids from the U.S. Census gazetteer where available. However, 47 zips weren't in the ZCTA lookup, and they account for 113 firms that participated in 3,997 pairs (real and synth). So I also ran GeocodeOneKey.py with a (no header) Zip\tZip input (MissingFirmZips.txt), manually added three records (02801,85292,91399) and loaded up the result as zipgeoaddon (from MissingFirmZips-Geocoded.txt) in Load.sql.**Also required that matches have a code(/20/100)**Changed the Experience table to count experience in previous years rather than previous rounddates, as then experience doesn't change within a market except through the match.*Changed the join to the ranking. New build uses PlaceYearRanking, and joins on placename, state, year (placename is available in PortCoSuper).**There was an error with Centerville Ohio appearing in the ranking twice each year from 1998 to 2001.  SELECT hasgeo, COUNT(*) FROM ComboDist_Code20 WHERE bodist IS NULL GROUP BY hasgeo; /* firm 52 none 4 pc 10444 */ Note: The following are available in firmbasecore but unused in firmvars: SELECT invstatus, rolepref, geogpref, indpref, stagepref FROM firmbasecore; SELECT invstatus, count(*) FROM firmbasecore group by invstatus ORDER BY invstatus; Actively seeking new investments 10856 Inactive / Unknown 1625 Making few if any new investments 68 Reducing investment activity 48 1257 SELECT rolepref, count(*) FROM firmbasecore group by rolepref ORDER BY rolepref; BO 965 DO 340 EI 757 OH 10 OT 131 11651 SELECT COUNT(*) FROM ( SELECT geogpref, count(*) FROM firmbasecore group by geogpref ORDER BY geogpref) AS T; --181 --Includes, states, nations, continents, etc. SELECT COUNT(*) FROM ( SELECT indpref, count(*) FROM firmbasecore group by indpref ORDER BY indpref) AS T; --112 --COULD MAYBE BE CODED IN TERMS OF CODE20, etc. ===Review of old dataset=== Anticipating a dataset rebuild, the old production dataset was '''MasterRealC20YearFullPlus.txt''', which was produced using: E:\projects\vcdb3\MatchingEntreps2VCs\MatchingVCEntrepRevisions.sql This seems to have an error in the construction of '''matchinstagebroad''' l129.
Xunkie MasterRealC20YearFull requires:*AllRealMatchKeysC20Code <-RealMatchescode, AllRealMatchKeysC20; AllRealMatchKeysC20 <-SynRealSetc20,RealMatchescode; SynRealSetc20 <-RealMatchescode; RealMatchesCode <-MatchHighestRandom, PortCoIindustry (RevisedDbaseCode.sql)*PortCoSuper <- likely new PortCoMaster*DealSuper: Would need to be rebuilt*FirmSuper <- likely new FirmVars and FirmGrowthPerf*AllMatchc20Dist <-AllMatchc20DistBase (RevisedDbaseCode.sql); AllMatchc20DistBase <- AllMatchKeysC20, portcogeo, firmbogeo; AllMatchKeysC20 <-RealMatchesCode, SynSetc20.*AllMatchc20InduHistWZero <- AllMatchKeysC20Code, AllMatchc20InduHist ... RoundLineMasterBase (RevisedDbaseCode.sql)*AllMatchc20InduTotal <- AllMatchKeysC20Code, RoundLineMasterBase (RevisedDbaseCode.sql) *rankingfull <- likely new placeyearranking Code for matchinstagenarrow CASE WHEN (dealseed >= 1 AND (firmstageprefno=1 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (dealearly >= 1 AND (firmstageprefno=2 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (deallater >= 1 AND (firmstageprefno=4 OR firmstageprefno=5)) THEN 1::int ELSE 0::int END AS matchinstagenarrow, ==Xunjie's Notes== Xunjie Zheng wrote "Notes on Estimation of Venture Model", which is in the project folder.
Note on that doc:
*We may want to redefine PC Rank so that the higher its value the higher the rank of the city
Ed's NotesPotential news vars:
*FracOfDealsInSector
*Patent Measure? Imprecisely estimated...
*Single node - 80 Cores
  ===GeneralDiscussion Points===
Discuss normalization:
*VC Exp is “matchprevindu20” in the codebook, which is defined as “Matched VCs prior experience investing in companies (i.e., count of companies) in the same pccode20 as the PortCo”. There might be a mistake in constructing this variable. The VC Exp for counterfactual matches are greater than or equal to the value for realized matches.
===Tablesin the paper===
Tables:
===Other Notes on Xunjie's Work===
The estimation strategy is a two-step simulated method of moments. The solver is "adaptive_de_rand_1 bin_radius_limited" in the
Julia package "BlackBoxOptim.jl", which executes Differential Evolution algorithm.
Julia code:
*https://github.com/robertfeldt/BlackBoxOptim.jl (support parallel CPU)
 
==Dataset Rebuild==
 
In E:\projects\vcdb20\BuildBaseTables.sql
 
*Changed MatchHighestRandom to MatchMaster. It is MatchMostNumerous (i.e., pick the firm with max(numportcos) for each portco from RLMaster) with a random tie break. It contains a lot of variables pertaining to the portco, firm, round, and match!
*MatchKeys is coname, statecode, datefirstinv, firmname, as well as minroundin, year, code, code20, code100. It replaces RealMatchesCode.
 
Code is in E:\projects\unobservedcomplementarities\BuildDataset.sql
*Replaced SynRealSetc20 with SynthKeys_Code20.
*Replaced AllRealMatchKeysC20Code with ComboKeys_Code20, also renamed realmatch variable to isreal.
**Note that ComboKeys_Code20 (591,299 with 554,561 synths) is much smaller AllRealMatchKeysC20Code (1,631,896 with 1,599,427 synths), which allowed (almost) any other firm from RLJoinerFF that had done a deal in that code20-year. ComboKeys_Code20 only allows another real match from the same code20-year.
*Rebuilt Super tables (PortCoSuper, DealSuper, FirmSuper). Note that FirmSuper is now restricted to US firms only (matches were already US-US because MatchMostNumerous was constrained to US (and state !='UN') firms and portcos only. However, it is not clear that this was true in the past.).
*Rebuilt the AllMatchc20 tables. New names are Combo..._Code20.
**Crucial difference: Only investments within ComboKeys_Code20 are included in the history counts. Before anything in RLJoinerFF or even RoundlineBase were included, which created the impression of overcounts.
**The vast majority of missing distances were caused by missing firm addresses for just 391 VCs (some of these were state UN, and later removed). However, we have zip codes for almost all of them. I built firmbogeoplus (see [[VCDB20]]) to add in zcta centroids from the U.S. Census gazetteer where available. However, 47 zips weren't in the ZCTA lookup, and they account for 113 firms that participated in 3,997 pairs (real and synth). So I also ran GeocodeOneKey.py with a (no header) Zip\tZip input (MissingFirmZips.txt), manually added three records (02801,85292,91399) and loaded up the result as zipgeoaddon (from MissingFirmZips-Geocoded.txt) in Load.sql.
**Also required that matches have a code(/20/100)
**Changed the Experience table to count experience in previous years rather than previous rounddates, as then experience doesn't change within a market except through the match.
*Changed the join to the ranking. New build uses PlaceYearRanking, and joins on placename, state, year (placename is available in PortCoSuper).
**There was an error with Centerville Ohio appearing in the ranking twice each year from 1998 to 2001.
 
 
 
SELECT hasgeo, COUNT(*) FROM ComboDist_Code20 WHERE bodist IS NULL GROUP BY hasgeo;
/*
firm 52
none 4
pc 10444
*/
 
Note: The following are available in firmbasecore but unused in firmvars:
SELECT invstatus, rolepref, geogpref, indpref, stagepref FROM firmbasecore;
SELECT invstatus, count(*) FROM firmbasecore group by invstatus ORDER BY invstatus;
Actively seeking new investments 10856
Inactive / Unknown 1625
Making few if any new investments 68
Reducing investment activity 48
1257
SELECT rolepref, count(*) FROM firmbasecore group by rolepref ORDER BY rolepref;
BO 965
DO 340
EI 757
OH 10
OT 131
11651
SELECT COUNT(*) FROM (
SELECT geogpref, count(*) FROM firmbasecore group by geogpref ORDER BY geogpref) AS T;
--181
--Includes, states, nations, continents, etc.
SELECT COUNT(*) FROM (
SELECT indpref, count(*) FROM firmbasecore group by indpref ORDER BY indpref) AS T;
--112
--COULD MAYBE BE CODED IN TERMS OF CODE20, etc.
 
===Review===
 
Anticipating a dataset rebuild, the old production dataset was '''MasterRealC20YearFullPlus.txt''', which was produced using:
E:\projects\vcdb3\MatchingEntreps2VCs\MatchingVCEntrepRevisions.sql
 
This seems to have an error in the construction of '''matchinstagebroad''' l129.
 
MasterRealC20YearFull requires:
*AllRealMatchKeysC20Code <-RealMatchescode, AllRealMatchKeysC20; AllRealMatchKeysC20 <-SynRealSetc20,RealMatchescode; SynRealSetc20 <-RealMatchescode; RealMatchesCode <-MatchHighestRandom, PortCoIindustry (RevisedDbaseCode.sql)
*PortCoSuper <- likely new PortCoMaster
*DealSuper: Would need to be rebuilt
*FirmSuper <- likely new FirmVars and FirmGrowthPerf
*AllMatchc20Dist <-AllMatchc20DistBase (RevisedDbaseCode.sql); AllMatchc20DistBase <- AllMatchKeysC20, portcogeo, firmbogeo; AllMatchKeysC20 <-RealMatchesCode, SynSetc20.
*AllMatchc20InduHistWZero <- AllMatchKeysC20Code, AllMatchc20InduHist ... RoundLineMasterBase (RevisedDbaseCode.sql)
*AllMatchc20InduTotal <- AllMatchKeysC20Code, RoundLineMasterBase (RevisedDbaseCode.sql)
*rankingfull <- likely new placeyearranking
 
Code for matchinstagenarrow
CASE WHEN (dealseed >= 1 AND (firmstageprefno=1 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int
WHEN (dealearly >= 1 AND (firmstageprefno=2 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int
WHEN (deallater >= 1 AND (firmstageprefno=4 OR firmstageprefno=5)) THEN 1::int
ELSE 0::int END AS matchinstagenarrow,
==Previous Work==

Navigation menu