Difference between revisions of "VC Startup Matching Stata Work"

From edegan.com
Jump to navigation Jump to search
Line 84: Line 84:
 
To run the linear probability model, we need to build a new dataset. This was partially done in the Stata Do-File explained above, but doing it in SQL will give the opportunity to be more flexible when choosing the synthetic match.
 
To run the linear probability model, we need to build a new dataset. This was partially done in the Stata Do-File explained above, but doing it in SQL will give the opportunity to be more flexible when choosing the synthetic match.
  
The idea is to build, for each startup, a new dataset that matches all VCs that were active in the year it matched with the real VC AND that were in the same Industry 100 code. This new set of synthetic firm for each startup will then be appended to the real match dataset.
+
The idea is to build, for each startup, a new dataset that matches all VCs that were active in the year it matched with the real VC AND that were in the same Industry 100 code. This new set of synthetic VCs (with VCs characteristics) for each startup will then be appended to the real match dataset.

Revision as of 17:14, 11 July 2018


McNair Project
VC Startup Matching Stata Work
Project logo 02.png
Project Information
Project Title VC Startup Matching Stata Work
Owner Marcos Ki Hyung Lee
Start Date 06/2018
Deadline
Keywords VC, Stata, Matching, Startup
Primary Billing
Notes
Has project status Active
Is dependent on Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists
Copyright © 2016 edegan.com. All Rights Reserved.


Exploratory files and dictionaries, as well as Stata Do-Files and Logs, are located in:

E:\McNair\Projects\MatchingEntrepsToVC\Stata

Synopsis

The VC Startup Matching Stata Work Project is support work for the Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists academic paper.

Estimate reduced form model and summary statistics to 'validate' the dataset for future structural estimation, following the literature evidence pointed out by David Hsu in the academic paper page.

To do so, I use the Startup-VC match dataset which contains variables regarding the startup, the VC and the match itself.

Stata Do-Files Guide

The directory

E:\McNair\Projects\MatchingEntrepsToVC\Stata

contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and raw output like Stata-to-tex tables are in their respective folders. Written reports in .tex are in the Tex folder.

Regarding Do-Files organization, the first file to be opened has to be 'master.do'. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.

For now, every do-file is more or less self-descriptive and self-contained.

Preliminary Analysis

A written report with detailed description of results can be found at

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex


Initial Look at Dataset/ SQL code change

Before attempting to do any statistical analysis, I performed an initial look at the raw dataset to spot possible problems.

There was a mistake in the synthetic VC's count of startups from the same sector and the current match, ie, variables 'synsumprevsamesector', 'synsumprevsameindu', 'synsumprevsameindu20', 'synsumprevsameindu10', as their values contained lots of -1 and 0s. To correct it, I changed the SQL code.

More specifically, when creating table 'FirmnameInduBlowout', when doing the JOIN, the weak inequality was changed to strict inequality. Then, when creating the next table, 'FirmnameRoundInduHist', I removed the subtraction. The same was done to the corresponding synthetic tables.

The code was run again where the tables were used, and a new dataset was created. Marcos Ki Hyung Lee (Work Log)

Summary Statistics

Summary statistics were produced using the 'summarystats.do' do-file.

Linear Probability Model

A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.

To perform this regression, it is necessary to build a new dataset. This is done on 'lpmsynthetic.do'.

At first look, this looks like a simple case of using the -reshape- function in Stata, since the original dataset is on a 'Wide' format, ie, the synthetic VC and its characteristics for each observation (startup) are variables (columns) itself, and we want to make them into observations (rows), with a dummy indicating when it is a real or synthetic match. However, the -reshape- command does not work with string variable names.

Therefore the do-file performs a manual reshape. After sending the results to Jeremy Fox, he felt that the results were not as expected and suggested some corrections.

Regressions

We want to know if VCs are more likely to match with geographically close startups, if patents are good signals for VCs, if VCs prefer serial founders and startups with similar demographic characteristics. We also want to know if startups prefer to match with VCs that have previous experience on startups of the same sector and VCs that prefer to invest in startups at their stage.

Since we don't have 'out-of-match' VCs and startups, I decided to do two different types of regressions.

I regress VCs all-time characteristics on their matched startups characteristics of interest, like distance, patents before match, demographic, etc. I am basically trying to see correlations. If 'good' VCs tend to match with very close startups, that had many patents before match, etc, then we can say there is some evidence of positive assortative matching.

On the other hand, if 'good' startups matched with VCs that were within their scope of investment, that had a history of investing in similar sectors, then these characteriscs are important for the startups.

Every regression has sector and VC founding year fixed effects.

Also, for all count variables, I've log-transformed it (adding 1 before to account for zeros) as suggested by Ed Egan. For the distance variable, I've also log-transformed it. Continuous variables are not log-transformed because most of them contains zeros, and adding 1 doesn't seem to make much sense.


Building new Dataset in SQL for a Linear Probability Model

To run the linear probability model, we need to build a new dataset. This was partially done in the Stata Do-File explained above, but doing it in SQL will give the opportunity to be more flexible when choosing the synthetic match.

The idea is to build, for each startup, a new dataset that matches all VCs that were active in the year it matched with the real VC AND that were in the same Industry 100 code. This new set of synthetic VCs (with VCs characteristics) for each startup will then be appended to the real match dataset.