Changes

Jump to navigation Jump to search
no edit summary
|Has paper status=In development
}}
 
==Counterfactual Variables==
 
Version 2-2 of the dataset includes the following sets of variables:
# firmappmoomirank,firmexitsrank,firmappexitmrank,firmappinvmrank,firmportcosrank,
# firmappmoomiwrank,firmappmoomitruncwrank,firmappexitmwrank,
# firmappmoomirankqtile,firmexitsrankqtile,firmappexitmrankrankqtile,firmappinvmrankqtile,firmportcosrankrankqtile,
# firmappmoomiwrankrankqtile,firmappmoomitruncwrankqtile,firmappexitmwrankrankqtile
 
The first set ranks the performance measures already in the dataset using only the 3542 VC firms in the dataset: apportioned MOOMI (firmapportionedmoomi), exits (firmexits), apportioned exit value (firmapportionedexitvaluem), apportioned investment (firmtotalapportionedinvm) and the total number of portcos invested in (firmnumportcos).
 
The second set calculates weighted ranks or truncated weighted ranks (where 1 is the best and 3542 is the worst):
* firmappmoomiwrank ranks firmappmoomi*(firmportcos/sumfirmportcos
* firmappmoomitruncwrank ranks firmappmoomi*((CASE WHEN firmportcos <= 75 THEN firmportcos ELSE 75 END)/sumfirmportcostrunc)
* firmappexitmwrank ranks firmappexitm*(firmportcos/sumfirmportcos)
 
The third and fourth sets are the ranks above as quartiles (i.e., 1 is best, 4 is worst).
 
By all means, play with the variables! My guess as to the "best" (i.e., closest approximation to true quality as measured by long-run returns) is (in order):
* '''firmappmoomiwrank''': This puts all the most famous VCs in the top quartile and the bottom quartile truly sucks.
* firmappmoomitruncwrank (harder to justify the truncation, so second place but gives higher rankings to successful small firms who face a pretty stiff size penalty in firmappmoomiwrank)
* firmappexitmwrank: Doesn't take into account investment but still delivers a good result
 
I also took a look at which types of firms to remove. It turns out that '''firmcat''' was already pretty well put together (i.e., I'd gone down that rabbit hole and excavated it). So, I suggest that we try the following components separately and then, depending on the results, consider the following groupings:
* Corporate
* PE
* Ecosystem + Angel
* Gov + SBIC
 
==Notes==
 
This section provides notes on the analysis in Selected_prelim.do. I ran the code and did a basic exploration of variables. These are the results:
* '''I couldn't make the sign on distance change.''' That's a real finding! I tried CA/MA fixed effects and interactions, Silicon Valley/Boston Cambridge fixed effects and interactions, using quadratic effects, various transformations... nothing undid the results. Also, log distance is the best variable.
* matchinstagebroad outperforms matchinstagenarrow in some contexts, we should use it.
* I tried logging all the vars, even t_pccitydollarsrankm1, to no avail.
* pcexitvaluem is already set to 0 when exit=0. It is missing for undisclosed value acquisitions. '''It's a great RHS variable: Log it and use it!'''
* I used mktid fixed effects and the results look pretty good without the interactions... but less good with them for the probit. There was a slight improvement for the reg, but only 2 vars are sig either way. I think maybe year ind is fine.
* Doing value conditional on exit gave just 1 *** and 1 *. It's the exit that drives the sig in exit value.
 
So, all in all, I say you're good to go with your 'best spec' if you want to keep the interactions:
probit pcexit l_matchbodist matchinstagebroad c.l_firmageatdeal##c.l_pcexpceopres c.l_matchprevportcos##c.t_pccitydollarsrankm1 i.year i.ind if realmatch==1, cluster(mktid)
 
However, '''the interactions are never significant'''. If you drop the weakest of the interactions (the one involving l_matchprevportcos*t_pccitydollarsrankm1), you're good to go for every variable except the remaining interaction, which is (almost) borderline (it's significant if your drop the industry fixed effects):
probit pcexit l_matchbodist matchinstagebroad c.l_firmageatdeal##c.l_pcexpceopres matchprevportcos t_pccitydollarsrankm1 i.year i.ind if realmatch==1, cluster(mktid)
 
Without interactions, the variables are all good:
probit pcexit l_matchbodist matchinstagebroad l_firmageatdeal l_pcexpceopres matchprevportcos t_pccitydollarsrankm1 i.mktid if realmatch==1, cluster(mktid)
 
A word on the interpretation of variables for the write-up:
* l_firmageatdeal is our VC wisdom variable
* matchprevportcos is our VC size variable
* l_pcexpceopres is our PC team quality variable
* t_pccitydollarsrankm1 is our PC environment quality variable
==Dataset Rebuild==
This project uses [[VCDB20]]. In E:\projects\vcdb20\ * Load.sql* BuildBaseTables.sql* Ranking.sql Specific to this project:* BuildDataset.sql
=== V2 ===
 
==== New dataset ====
 
The new file is: MasterCode20YearV2-1.txt. It's in the dropbox!
 
===== V2-1 Changes =====
 
There were some issues with bodistkm being rather extreme (i.e., ~6000-8000km):
* A single portco, CodeHS had an incorrectly geocoded addresss. Despite being listed as San Francisco, CA, addr2 was "Babraham Research Campus", which Google Maps was incorrectly associating with the village of Babraham in Cambridge, UK. The address for this firm has been manually fixed (the correct address is 1328 Mission St, San Francisco, CA 94103) and its correct geocoding has been pushed through the tables (portcogeo, portcopoints, portcomaster, portcosuper) to the final dataset. CodeHS received its first round in 2012 and may achieve an exit but probably .
* The remaining extreme values were caused by portcos or firms being located outside of the continental U.S. (i.e., in HI, AK, or PR). When such a firm was paired with a mainland firm, it would have an extreme distance. '''The dataset is now restricted to the continental U.S. The largest distances are now ME-CA pairs or WA-FL pairs, as expected.'''
 
==== Summary of request ====
Objectives:
Xunjie's list of variables for estimation:
* matchhqdist -- matchbodist is preferred. was 152/500k nulls, should be done now.
* matchinstagenarrow -- Now in the dataset with improved logic. Probably use matchinstagebroad instead. No nulls.
* firmfirstinvyear -- firmageatdeal is preferred. No Nulls.
* matchprevportcos -- no nulls.
* pcnumperson -- this is a conceptually and operationally terrible variable! See below.
* pccitydollarsrankm1 -- matching on placename has issueshad lots of missing!This should be resolved now.
* pcexp -- similar issues to pcnumperson. See below.
Xunjie's Restrictions:
* I only keep data between 2002 and 2016.
* And I only keep the matching markets if the number of real matches is more than or equal to 5.
So less than 1/3 matching markets survive."
Dropping the entire market is surely way too extreme. We should just drop the offending portco and only drop the market if the number of real matches drops below our threshold (e.g., 5). I've included some new market stats to give analytics: mktdealcount, mktnumreal, mktnumsyn, mktnumfirms, mktvalid. ==== Review of Changes ==== In BuildBaseTables.sql:* Fixed PortCoGeoid to use zipcodes too to determine placename * Created separate geoid lookup table for place, statecode: PlaceStatecodeGeoid* Pushed changes through PortCoMaster* Added new vars to PortCoPeople and pushed them through. Added restrictions to MatchMostNumerous (34604) by creating a temp view RLMasterRestricted where:* Firm and portco nation code='US'* Firm and portco statecode code!='UN'* Code is not null* placename is not null* hqdistkm is not null In Ranking.sql* Re-run with updated portcogeoid!* Pick up Geoid from new PlaceStatecodeGeoid table (BuildBastTables.sql) In BuildDataset.sql. * SynthKeys_Code20: Pushed changes through. 488065* ComboKeys_Code20: 522669 = 488065 + 34604* PortCoSuper (not restricted) - uses lastest version of PortCoMaster. Placename from PortCoGeoid, which uses zipcodes too. DealSuper (restricted to MatchKeys!), FirmSuper (now US only, but not restricted)* Combodist, ComboIndu, and ComboMeasures tables (all based on ComboKeys_Code20) much as before but with new base sets* ComboStats_Code20 added to provide market info.* MasterCode20Year. Rerun with new feeders. New variables added. ==== pccitydollarsrankm1 ==== There are a number of possible explanations for why this variable had lots of missing.  There do seem to be missing placenames. 4855/69882 PortCoSuper records don't join to PlaceYearRanking on placename and state (ignoring year) and 4,561 of these have valid zips. However, only 263 had growth VC and just 82 has non-null positive invested amounts, so this isn't the issue.
Ultimatately, I rebuild the underlying tables (portgeoid, etc.) and created a new lookup table (PlaceStatecodeGeoid), and then reran the rankings making sure to keep the "no activity" places for each year (tied for last place). The ranking variables should be fixed now. ==== pcnumperson / pcexp ====
pcnumperson suffers from a number of endogeneity issues, including:
# Thomson adds information each time the firm recieves more investment, so pcnumperson is correlated with the number of rounds, amount invested, prob of exit, etc.
# Higher quality firms/portcos are more likely to report the people in a portco.
# numperson includes a broad range of titles, roughly VP-level and above with some extras, and more organized portcos may report deeper into their ranks.# It's possible that some non-exec board members are included
pcexp suffers from all of the above issues and more. pcexp has the following lineage:
** A portco with 1 person who has held two previous positions has pcexp=2
** A portco with 2 people who have each held one previous position has pcexp=2
* Non-exec board members (lawyers, investors, etc.) may have worked with lots of previous firms and be inflating this count!
I rebuilt these variables so that they have better coverage where possible. I also set doctors, serials, serialceopreses, serialfounders, prevs, prevceopreses, prevfounders to zero when missing in PortCoSuper (I left them as null in PortCoPeopleMaster).
We shouldn't use numperson at all. It's just horrible. Instead we should try one of the following:
* serialceopreses
* serialfounders
* serials
* doctors (maybe for something different)
* prevceopreses
* prevfounders
 
But I expect that you we have problems with variation.
==== match in stage ====
WHEN firmstageprefno IS NULL AND firmcat IN ('Ecosystem','SBIC','Angel','Gov') AND NOT (dealseed >= 1 OR dealearly >= 1) THEN 0::int
WHEN firmstageprefno IS NULL THEN 1::int
ELSE 0::int END AS matchinstagebroad,
=== V1 = Other ====
I added together the patents and SBIR grants pre and during VC to create the following variables (each has variation issues, but maybe try in order):
* pchaspatentsvc (1/0 indicator for portco has patents)
* pcpatentsvc (number of patents)
* pcsbircountvc (number of SBIR grants)
* pcsbiramountvc (value of SBIR grants)
 
=== Changes to date ===
 
Code is in E:\projects\unobservedcomplementarities\BuildDataset.sql
 
Changes:
*Changed MatchHighestRandom to MatchMaster. It is MatchMostNumerous (i.e., pick the firm with max(numportcos) for each portco from RLMaster) with a random tie break. It contains a lot of variables pertaining to the portco, firm, round, and match!
*MatchKeys is coname, statecode, datefirstinv, firmname, as well as minroundin, year, code, code20, code100. It replaces RealMatchesCode.
 
Code is in E:\projects\unobservedcomplementarities\BuildDataset.sql
*Replaced SynRealSetc20 with SynthKeys_Code20.
*Replaced AllRealMatchKeysC20Code with ComboKeys_Code20, also renamed realmatch variable to isreal.

Navigation menu