Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists

Jump to navigation Jump to search
Academic Paper
Title Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists
Author Ed Egan, Jeremy Fox, David Hsu, Chenyu Yang
RAs Meghana Gaur, James Chen, Kyran Adams, Marcos Ki Hyung Lee, Wei Wu
Status In development
©, 2016

Current Work

All work is in:


Dataset was buit from vcdb3 (see below). Latest data files are:

  • MasterRealC20YearFullPlus.txt
  • MasterRealC20YearFullPlus - DataDictionary.txt


  • Marcos' old work is in .\marcos
  • Chenyu's code and datasets are in .\matlab
  • .\linearmodel is the current STATA work

Chenyu's box (available until Oct 31st 2019) is here:

It contains:

  • working folder
  • sample batch files

Both folders were cloned to E:\projects\unobservedcomplementarities\Chenyusbox on 17th July 2019.


  • data_import3.m uses MasterRealC20YearFullPlus.txt, which is the latest dataset

Linear Model

The objective is to add city ranking and serials, possibly as well as no. coinvestors, and VC experience x no. coinvestors, to a linear model. The data for the linear model should include real and synthetic matches. However, to make it comparable to Chenyu's data, we need to exclude some markets.

Marcos' used Z:\VentureXpertDB\vcdb3\MasterRealOneSynth.txt as a base dataset, which contained only a single synth. However, in, he loads MasterWithSynthcode20.txt instead. Note that some of Marcos' do files were not in the dropbox but were in E:\mcnair\Projects\MatchingEntrepsToVC\Stata\DoFiles

Data notes:

  • Exit, exitvalue and related measures are going to be right censored

In data_import2.m, Chenyu has the following restrictions that I clone in STATA (counts are mine):

  • He starts with MasterC20YearFull.txt, rather than MasterRealC20YearFullPlus.txt (which suggests he isn't using the latest data)
  • Mkts are pccode20,dealminroundyear
  • Removes unmatched VCs and startups (shouldn't be in latest dataset?)
  • Requires that matched VCs have synthetics with all startups in the market (should be redundant now?)
  • Requires there to tbe >=5 and <=10 real matches in a market
    • This reduces the number of obs from 445,710 to just 59,205 (13.3%)
  • Requires the year to be >=1990 and <=2001
    • 142,738 out of 445,710 (32%), or 18,055 out of 59,205
  • Removes duplicates (should be redundant with revised data?)
  • Removes markets with marketid NaN (not clear why this happens)
  • In master_dyad.m, Chenyu has year bounds of 2002 and 2016. This upper bound likely has right censoring on exits.

The STATA do file is in:


Rebuilding Marcos

Marcos starts with a dataset of reals with a single synthetic, and then constructs a dataset of reals with all synthetics (in the same year and code20).

Table 1 gives some LPMs using two sets of variables with and without VC-yearmet fixed effects. These are replicated in the new do file. In order to get something close to Marcos's reported numbers, I create a one-to-one variable so that each real match has only a single synthetic match. This gives about 60k observations as compared with Marcos's 64K (and as opposed to 445k for the full sample). The coefficients are very close to those in Table 1. There are some caveats, however. Marcos is using:

  • Amounts in billions (as am I) without taking logs (of 1+x)
  • Firmid x year (which he refers to as VC x yearmet) fixed effects, as opposed to year (i.e., dealminroundyear) x pccode20 fixed effects, which correctly define a market
  • No restrictions on timing

Table 5 gives some LPMs before and after a Lasso. The hqdist variable was first transformed so that hqdist = hqdist/1000. Note that the matchhqdist variable is bimodal. Matchbodist is also bimodal but not as strongly. The second spike in the distribution is just over 4000km, which is the arc distance from San Francisco to Boston (4335km [1])

Again the data is just a single synthetic for each real. In this analysis, Marcos also clusters the standard errors at the year level, but does not use any fixed effects.

The labels in the pdf are somewhat misleading. The margin command reports only the underlying covariates not the interactions (unless you specifically generate the variables). An analysis of just the underlying variables without the interactions would have produced markedly different margins! The margins in table 6 column 1 of the pdf are coming from the following:

PDF -> source
hdqist -> c.hqdist##c.hqdist
sumprevsameindu20 -> c.sumprevsameindu20##c.sumprevsameindu20
serials -> c.serials##c.numprevportco
numprevportcos -> c.patentsprevc##c.numprevportco
firmtenure -> c.serials##c.firmtenure 
patentsprevc -> c.patentsprevc##c.firmtenure

Note that STATA uses ## to report both main effects for each variable as well as an interaction, so c.hqdist##c.hqdist reports both hqdist and hqdist^2, while c.serials##c.numprevportco reports serials, numprevportco, and serials*numprevportco. Variables are omitted when duplicated as in c.serials##c.numprevportco and c.patentsprevc##c.numprevportco, which both report numprevportco.

We don't get the same lasso results as Markus:

Variable          MarcosLasso NewLasso
hdqist            yes         yes
sumprevsameindu20 yes         yes
serials           yes         no
numprevportcos    yes         no
firmtenure        yes         yes
patentsprevc      no          no

But Marcos's spec isn't very grounded. He clusters standard errors at the year level but uses no fixed effects. We want to know what goes on inside markets, implying market-level fixed effects. He believes that "Since non-match specific variables are not used in the structural model, we have to interact VC or Startup specific variables." I'm not sure that this is correct. He goes on to say that "Therefore, the main specification is one which every match-specific variable has a quadratic interaction, and startup and VC variables are interacted with each other. Also, we exclude industry code from the model because it is a discrete variable, and we transform VC founding year to VC tenure, which subtracts the former with year of match."

Industry certainly won't matter with market fixed effects. Marcos also used numprevportco as if it was purely a VC variable, rather than being closer to a match specific variable.

I tried Marcos's approach using all of the possible variables (old and new) but always and only using firmtenurel as a VC interaction variable (as firmportcosl is used to pick the real from the list of potential reals, and as firmapportione~ml is correlated with firmportcosl). I will also only use pccityoverallr~1l as the PortCo interaction variable, as that's the only PortCo variable that survives to significance.

The result was:

. margins, dydx(*) post

Average marginal effects                        Number of obs     =    381,882
Model VCE    : Robust

Expression   : Linear prediction, predict()
dy/dx w.r.t. : pccityoverallrankm1l firmtenurel firmportcosl matchprevindu20l matchbodistl
               matchinstagenarrow matchcity matchstate

                     |            Delta-method
                     |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
pccityoverallrankm1l |   .0055706   .0002035    27.38   0.000     .0051718    .0059694
         firmtenurel |   .0059353   .0005165    11.49   0.000     .0049229    .0069477
        firmportcosl |   .0052155   .0004399    11.86   0.000     .0043532    .0060777
    matchprevindu20l |  -.0536725   .0007413   -72.41   0.000    -.0551254   -.0522196
        matchbodistl |  -.0106516   .0003413   -31.21   0.000    -.0113205   -.0099826
  matchinstagenarrow |   .0057086   .0007494     7.62   0.000     .0042398    .0071774
           matchcity |   .0684326   .0041129    16.64   0.000     .0603715    .0764937
          matchstate |   .0436431   .0015343    28.45   0.000      .040636    .0466503

Finally, collapse the dataset by summing realmatch and produce a histogram and some analysis.

Notes from Conference Call

On July 5th at 10am, the co-authors had a conference call. These are the notes from that call:

Ed described the data. Chenyu said that he is running the estimation as two periods: one before and one after dot com crash. He is estimating:

  • VCExperience (# prior deals in pccode20) x ranking ($amount or overall, lagged 1yr or 5yr)
  • No. of Co-investors
  • Distributional params of unobservables

The estimates are quite stable. But when he includes VC Experience x co-investors, estimates come much larger for unobserved hetero. It is possible that this is because of the optimization process. It uses a GA and the objective function is non-smooth because it relies upon Method-of-Moments (and hence the GA might not be finding the global max/min). However, it could also be a with the moments. This is the current sticking point. Possible solutions are:

  • Replace the variable with total number of previous startups by VC?
  • Scale Co-investors?

Ed is going to reproduce table 5 with new regression evidence for city ranking and serials. He's also going to check what is in dropbox and clear it up.

To Discuss:

  • Patents - to many zeros, Serials - not specific to a market


  • No resorting
  • Take away different categories
    • Mkt vs. Non-mkt


  • Currently running it on a 120 core cluster. Each job takes 12 cores - 20 seconds for each evaluation, 6 free params.
  • Needs Low RAM (3GB per core).
  • Rochester - 120hr max war time

VCDB3 Rebuild

The dataset was rebuilt using vcdb3 -- See VentureXpert Data and then fixed by Ed using RevisedDbaseCode.sql in E:\projects\vcdb3 and /bulk/vcdb3.

After that, Ed did the following:

  1. Rebuild the code so that only matched VCs are used as synthetics
  2. Add year and industry as variables
  3. Add variables:
    1. City rankings over time, lagged by 1 year and by 5 years
    2. Age
    3. Various serial measures
    4. Rebuild data dictionary!
    5. Definition of industry codes again



Fix PortCoMatchMaster and copeopleaggsimple to make and pass:

E.serialceopres, E.serialfounder, E.ceopres, E.singularceopres, E.founders, E.hasfounders, E.prevs, E.prevceopres, E.prevfounders

Also fix doctors!


Fix PortCoSuper and add:

C.serialceopres AS pcserialceopres,
C.serialfounder AS pcserialfounder, 
C.ceopres AS pcceopres, 
C.singularceopres AS pcsingularceopres, 
C.founders AS pcfounders, 
C.hasfounders AS pchasfounders, 
C.prevs AS pcprevs,
C.prevceopres AS pcprevceopres,
C.prevfounders AS pcprevfounders,
C.serialceopres*C.singularceopres AS pcserialceopressingular,
C.serialfounder*C.hasfounders AS pcserialfounderhas,
C.prevceopres*C.singularceopres AS pcprevceopressingular,
C.prevfounders*C.hasfounders AS pcprevfoundershas

Fix E:\projects\vcdb3\OriginalSQL\Ranking.sql (Note originally fixed in E:\projects\vcdb3\Ranking.sql). Specifically, add the rankingfull queries: city, state, year, dollarsrank, dealsrank, aliverank, overallrank

Finally, in:



  • MatchingVCEntrepRevisions.sql
  • MasterRealC20YearFullPlus.txt
  • MasterRealC20YearFullPlus - DataDictionary.txt

Variables for inclusion

New potential variables still being considered for inclusion:

  • VC historic CAPUM?
  • Industry-Year Measures? Needs input from Chenyu. Likely not useful.

Currently Chenyu is using:

  • Distance
  • Sub-sector specific expertise of VC - could broaden definition
    • Currently: most small 10-15 matches using pccode20
    • Might end up with more large markets
  • Startup specific experience
    • patent counts - mostly 0s: 95%.

This will be revised once he has the new data set.

Chenyu is now going to do:

  • Monte Carlo with data from empirical distro
  • Actual estimation - doesn't take long
  • Reduced form estimation: VC investment and outcomes? Logit? outcomes (exit measures). Real match explatory variable, match characteristics, controlling
  • Target: May

Running Chenyu's code on HPCC

Two Wharton ugrads: Stacey and Kenneth (no account yet) are going to try running Chenyu's code on the HPCC. Chenyu is going to put everything into Box and invite us all to it.

Reference Papers

Jeremy's paper with David Hsu and Chenyu Yang is here: Unobserved Heterogeneity in Matching Games with an Application to Venture Capital.

Abstract: Agents in two-sided matching games vary in characteristics that are unobservable in typical data on matching markets. We investigate the identification of the distribution of unobserved characteristics using data on who matches with whom. In full generality, we consider many-to-many matching and matching with trades. The distribution of match-specific unobservables cannot be fully recovered without information on unmatched agents, but the distribution of a combination of unobservables, which we call unobserved complementarities, can be identified. Using data on unmatched agents restores identification. We estimate the contribution of observables and unobservable complementarities to match production in venture capital investments in biotechnology and medical firms.

Fox Hsu Yang (2015) - Unobserverd Heterogeneity in Matching Games with an Application to Venture Capital provides some notes.

Previous Work (for reference)

Matlab Code

Abhijit Brahme (Work Log) contains his notes on working with the Matlab code. There is a seperate page here: Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code.

Data specification

The data spec sent to Jeremy is in:


Data foundations

The database is vcdb2

The foundational tables were built using:


The documentation, which is a little messy, is on VC Database Rebuild

Our SQL script, which builds on top of the above database (still in vcdb2) is in:


Dataset build


Decisions we need to make:

  • Will we need synthetic matches? If so what we do we do for outcomes? Can still do dyadic and left/right pair variables.
  • Granularity of industry: To start let's use minor industry group (see below). We use a much finer grained industry definition and aggregate back up to balance out the counts somewhat later.
  • Matching to a fund or a firm: For now, we will work with funds, though deals are sometimes transferred across funds within a firm (i.e. from Kliener fund IV to Kliener fund V), this is probably comparatively rare (check!).
  • Dealing with the right censorship problem: We can likely address this with indicator variables to condition on, but may want to restrict estimation to dyads that don't have this issue. For now we will take portfolio companies that received their last investment before 2007, to allow funds a full 10 years to clear their portfolios.
  • Inadequate coverage in early years: VentureXpert's coverage is notably inferior prior to 1982. We should start with portco that received their first investment in 1985 and forward.
  • Determination of lead VC - see below
  • How to collapse VC rounds (date, amount, etc.): We will use only seed, early, later stage investment and insist on the presence of seed/early for inclusion. We can then have date first, investment duration (to date last), total investment.

Objective dataset description

Unit of observation - a startup-fund match.


  • PortCo name disclosed
  • PortCo date of first investment >= 1/1/1985
  • PortCo date of last investment <= 2007 to allow 10 yrs for the funds
  • PortCo received at least one round of Seed or Early stage investment
  • Matched VC is not undisclosed



  • PortCo ID
  • PortCo Name
  • Longitude, latitude,
  • State of inc., industry, year of founding, year of first investment, year of last investment
  • SEL $invested, SEL num rounds, transactional VC indicator and $inv, investment duration SEL (yrs)
  • Exit indicator, exit value, exit type indicator
  • alive2016 indicator, last round pre-2012 indicator
  • total MOOMI (Money Out Over Money In)


  • fund ID
  • fund name
  • Number of funds investing (SEL)
  • As averages (?) and for lead:
    • Fund ipo count, Fund M&A count, Fund investment count(calc at end), fund ipo rate, fund M&A rate, fund exit count, fund exit rate, fund ipo $, fund M&A $, fund exit $, fund fraction of MOOMI.
  • Total invested by lead, number of rounds participation by lead, stage of participation of lead, location of lead, last investment pre-2012 indicator, lead fund type indicator (corp, priv, gov, etc.), lead fund size, lead fund vintage year.

Dyadic variables:

  • Distance between lead and portco,
  • industry preference match between lead and portco
  • maybe stage-match (doesn't make a lot of sense when collapsing rounds) between lead and port co.

Identifying lead VCs

Possible methods:

  • Best performing participant (on exit count/value or fractional MOOMI) with tie-breaker
  • Closest participant (using great circle distance)
  • Most frequent participant with tie-breaker
  • Participant with greatest investment with tie-breaker
  • Participant in earliest round that stayed in for longest with tie-breaker

Minor Industry

Across all time and without regard to SEL vs. transaction, here's the minor industry list and counts:

        indminorgroup          | count
Industrial/Energy              |  2871
Internet Specific              |  8794
Biotechnology                  |  2592
Semiconductors/Other Elect.    |  2402
Other Products                 |  4891
Computer Hardware              |  2061
Computer Software and Services | 10550
Communications and Media       |  3271
Medical/Health                 |  4373
Consumer Related               |  3161

Literature from David

Literature to "validate" our sample. I think you probably know the papers I reference below (let me know if you need any of them-some for which I am coauthor you can get from my website).

  1. VCs are more likely to match with geographically proximate startups (Lerner on corporate governance, Sorenson on geography)
  2. Startups prefer to match with VCs with domain experience within their startup sector (Morten Sorensen), possibly also prefer to match by stage of VC specialization relative to their own stage of development (not sure which paper if any documents that)
  3. Startup patents signal VCs (Hsu/Ziedonis in SMJ)
  4. VCs prefer serial founders, or at least may interact differently with founders based on their prior founding experience (Hsu 2007 in Research Policy)
  5. If we have access to more individual data: VCs prefer to invest in founders with similar demographic characteristics relative to their own characteristics (Gompers et al within the past few years in JFE, Bengtsson and Hsu in JBV within the last few years).

SBIR and Patent Data

SBIR Data taken from McNair\Projects\SBIR\Data\Aggregate SBIR\SBIR.txt. -Note! This file needed to be opened in excel to be readable, and took a very long time to open due to its large size. SBIR firm names converted to a pivot table to eliminate exact repeat entries, and then exported to a txt file, NSBIR. NSBIR then matched using The Matcher in mode 2 with the following code:

"-file1="NSBIR.txt" -file2="NSBIR.txt" -mode=2" 

Output then placed in:


The original pre-matched, cleaned NSBIR.txt file is moved to:


There is a sql file to extract VC portcos (SEL backed only), with key info from vcdb2, and distinct assignee names from allpatentsprocessed here:


There are three input files:

  • distinctNSBIR.txt - made by pivot tabling SBIR.txt from the SBIR aggregation project
  • distinctassignees.txt - extracted as distinct from allpatentsprocessed
  • vcbackedselcokeys.txt - extracted with key info from vcdb2. It needs pivot tabling to get unique names.

These .txt files were made distinct, and then matched against themselves for normalization. The normalized files still need to be matched against each other. They are located in:


These normalized files were then matched against each other. Approximately 12,000 matches. they are located in:

McNair\Projects\MatchingEntrepsToVC\Matching\Normalized & Matched