VC Startup Matching Stata Work

Jump to navigation Jump to search

VC Startup Matching Stata Work
Project logo 02.png
Project Information
Has title VC Startup Matching Stata Work
Has owner Marcos Ki Hyung Lee
Has start date 06/2018
Has deadline date
Has keywords VC, Stata, Matching, Startup
Has project status Active
Is dependent on Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists
Has sponsor McNair Center
Has project output Tool
Copyright © 2019 All Rights Reserved.

Exploratory files and dictionaries, as well as Stata Do-Files and Logs, are located in:



The VC Startup Matching Stata Work Project is support work for the Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists academic paper.

Estimate reduced form model and summary statistics to 'validate' the dataset for future structural estimation, following the literature evidence pointed out by David Hsu in the academic paper page.

To do so, I use the Startup-VC match dataset which contains variables regarding the startup, the VC and the match itself.

Stata Do-Files Guide

The directory


contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and raw output like Stata-to-tex tables are in their respective folders. Written reports in .tex are in the Tex folder.

Regarding Do-Files organization, the first file to be opened has to be ''. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.

For now, every do-file is more or less self-descriptive and self-contained.

Preliminary Analysis

A written report with detailed description of results can be found at


Initial Look at Dataset/ SQL code change

Before attempting to do any statistical analysis, I performed an initial look at the raw dataset to spot possible problems.

There was a mistake in the synthetic VC's count of startups from the same sector and the current match, ie, variables 'synsumprevsamesector', 'synsumprevsameindu', 'synsumprevsameindu20', 'synsumprevsameindu10', as their values contained lots of -1 and 0s. To correct it, I changed the SQL code.

More specifically, when creating table 'FirmnameInduBlowout', when doing the JOIN, the weak inequality was changed to strict inequality. Then, when creating the next table, 'FirmnameRoundInduHist', I removed the subtraction. The same was done to the corresponding synthetic tables.

The code was run again where the tables were used, and a new dataset was created. Marcos Ki Hyung Lee (Work Log)

Summary Statistics

Summary statistics were produced using the '' do-file.

Linear Probability Model


A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.

To perform this regression, it is necessary to build a new dataset. This is done on ''.

At first look, this looks like a simple case of using the -reshape- function in Stata, since the original dataset is on a 'Wide' format, ie, the synthetic VC and its characteristics for each observation (startup) are variables (columns) itself, and we want to make them into observations (rows), with a dummy indicating when it is a real or synthetic match. However, the -reshape- command does not work with string variable names.

Therefore the do-file performs a manual reshape. After sending the results to Jeremy Fox, he felt that the results were not as expected and suggested some corrections.


We want to know if VCs are more likely to match with geographically close startups, if patents are good signals for VCs, if VCs prefer serial founders and startups with similar demographic characteristics. We also want to know if startups prefer to match with VCs that have previous experience on startups of the same sector and VCs that prefer to invest in startups at their stage.

Since we don't have 'out-of-match' VCs and startups, I decided to do two different types of regressions.

I regress VCs all-time characteristics on their matched startups characteristics of interest, like distance, patents before match, demographic, etc. I am basically trying to see correlations. If 'good' VCs tend to match with very close startups, that had many patents before match, etc, then we can say there is some evidence of positive assortative matching.

On the other hand, if 'good' startups matched with VCs that were within their scope of investment, that had a history of investing in similar sectors, then these characteriscs are important for the startups.

Every regression has sector and VC founding year fixed effects.

Also, for all count variables, I've log-transformed it (adding 1 before to account for zeros) as suggested by Ed Egan. For the distance variable, I've also log-transformed it. Continuous variables are not log-transformed because most of them contains zeros, and adding 1 doesn't seem to make much sense.

Building new Dataset in SQL for a Linear Probability Model


To run the linear probability model, we need to build a new dataset. This was partially done in the Stata Do-File explained above, but doing it in SQL will give the opportunity to be more flexible when choosing the synthetic match.

The end result is a table that lists all matches that could have occurred in every possible market, including the real one.

First, we need to exactly define what a market is. In this case, a market consists of all matches that occurred in a year and within a industry sector, usually defined by a code. Therefore, the size and type of market hinges on what industry code is being used.

There are three categories, each one more granular, that defines a startup industry. The broader one is the industry class'with 3 categories, 'Information Technology', 'Medical/Health/Life Science' and 'Non-High Tech'. After that there is the Minor group, with categories such as Communications and Media, Computer Hardware, or Biotech, or Consumer Related. After that, the finer one is the Subgroup, which gets very specific, like Wireless Communication Services or Medical Imaging.

A industry code is then a 4-digit number, where the first belongs to the industry Class, the second to the industry Minor group and the last two to the Subgroup. We aggregate Subgroups with less than 20 observations (ie, number of startups) in an 'Other' category to create 'code20', and an analogous 'code100' for less than 100 observations.

We want to create a table that lists for each unique portco all the firms in its market, ie, active in the year it had its first investment from the real matched VC and that had invested in a portco of the same code100/20 in that year.

After that we can simply append/union the real match table and calculate the variables from the original dataset on this new table.

The code that does this is called 'CreatingLPM_withoutsyn.sql' when using code100, and 'CreatingLPM_withoutsyn_code20.sql'. Augi reworked and streamlined it.

At the end of the code, we also create a LPM dataset, instead of having to do a manual reshape in Stata.


The code 'Histograms.sql' exports two tables to


called 'DistribCode100.sql' and 'DistribCode20.sql'. After that, I import them into Excel and create histograms to characterize the distribution of market size. The excel file is in


Reduced Form Analysis of the Dataset

An extensive reduced form analysis was employed with a lot of back and forth feedback between Jeremy Fox and Ed Egan and me. I documented everything I did in a pdf file generated from LaTeX. Since converting it to Wiki format would be too cumbersome, including converting multiple tables and figures, I've decided it is better to host the latex files and pdf in the folder below


Everything necessary to produce the pdf file is there. Open the .tex file 'regressions.tex' and build it using your preferred latex compiler. A very easy option is to use some online compiler.

Paralelization of Matlab Code

This was done by Wei Wu. I will briefly summarize what he told me.

His main objective was to paralelize as much as possible Chenyu's code in Matlab. Apparently, this was done successfully. What changed is documented in Parallelize msmf corr coeff.m.

He also had two other projects that did not end up working. One was to use the GPU to speed up even more the code. The reasons are well documented in Matlab, CUDA, and GPU Computing.

Finally, he also tried expanding the paralelization by using NOTS (Night Owls Time-Sharing Service), a computing cluster. Since the paralelization was succesful, expanding the number of cores available was the logical next step. He ran into problems which I couldn't understand very well. Additionally, NOTS is not Windows-friendly. Check NOTS Computing for Matching Entrepreneurs to VCs for more.