Difference between revisions of "VC Startup Matching Stata Work"

McNair Project
VC Startup Matching Stata Work
Project Information
Project Title	VC Startup Matching Stata Work
Owner	Marcos Ki Hyung Lee
Start Date	06/2018
Deadline
Keywords	VC, Stata, Matching, Startup
Primary Billing
Notes
Has project status	Active
Is dependent on	Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists
	Copyright © 2016 edegan.com. All Rights Reserved.

Revision as of 13:48, 11 July 2018

Exploratory files and dictionaries, as well as Stata Do-Files and Logs, are located in:

E:\McNair\Projects\MatchingEntrepsToVC\Stata

Synopsis

The VC Startup Matching Stata Work Project is support work for the Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists academic paper.

Estimate reduced form model and summary statistics to 'validate' the dataset for future structural estimation, following the literature evidence pointed out by David Hsu in the academic paper page.

To do so, I use the Startup-VC match dataset which contains variables regarding the startup, the VC and the match itself.

Stata Do-Files Guide

The directory

E:\McNair\Projects\MatchingEntrepsToVC\Stata

contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and raw output like Stata-to-tex tables are in their respective folders. Written reports in .tex are in the Tex folder.

Regarding Do-Files organization, the first file to be opened has to be 'master.do'. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.

For now, every do-file is more or less self-descriptive and self-contained.

Preliminary Analysis

A written report with detailed description of results can be found at

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex

Initial Look at Dataset

Before attempting to do any statistical analysis, I performed an initial look at the raw dataset to spot possible problems.

There was a mistake in the synthetic VC's count of startups from the same sector and the current match, ie, variables 'synsumprevsamesector', 'synsumprevsameindu', 'synsumprevsameindu20', 'synsumprevsameindu10', as their values contained lots of -1 and 0s. To correct it, I changed the SQL code.

More specifically, when creating table 'FirmnameInduBlowout', when doing the JOIN, the weak inequality was changed to strict inequality. Then, when creating the next table, 'FirmnameRoundInduHist', I removed the subtraction. The same was done to the corresponding synthetic tables.

Summary Statistics

Summary statistics were produced using the 'summarystats.do' do-file.

Linear Probability Model

A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.

To perform this regression, it is necessary to build a new dataset. This is done on 'lpmsynthetic.do'.

At first look, this looks like a simple case of using the -reshape- function in Stata, since the original dataset is on a 'Wide' format, ie, the synthetic VC and its characteristics for each observation (startup) are variables (columns) itself, and we want to make them into observations (rows), with a dummy indicating when it is a real or synthetic match. However, the -reshape- command does not work with string variable names.

Therefore the do-file performs a manual reshape. After sending the results to Jeremy Fox, he felt that the results were not as expected and suggested some corrections.

Regressions

We want to know if VCs are more likely to match with geographically close startups, if patents are good signals for VCs, if VCs prefer serial founders and startups with similar demographic characteristics. We also want to know if startups prefer to match with VCs that have previous experience on startups of the same sector and VCs that prefer to invest in startups at their stage.

Since we don't have 'out-of-match' VCs and startups, I decided to do two different types of regressions.

I regress VCs all-time characteristics on their matched startups characteristics of interest, like distance, patents before match, demographic, etc. I am basically trying to see correlations. If 'good' VCs tend to match with very close startups, that had many patents before match, etc, then we can say there is some evidence of positive assortative matching.

On the other hand, if 'good' startups matched with VCs that were within their scope of investment, that had a history of investing in similar sectors, then these characteriscs are important for the startups.

Every regression has sector and VC founding year fixed effects.

Also, for all count variables, I've log-transformed it (adding 1 before to account for zeros) as suggested by Ed Egan. For the distance variable, I've also log-transformed it. Continuous variables are not log-transformed because most of them contains zeros, and adding 1 doesn't seem to make much sense.

Difference between revisions of "VC Startup Matching Stata Work"

Revision as of 13:48, 11 July 2018

Contents

Synopsis

Stata Do-Files Guide

Preliminary Analysis

Initial Look at Dataset

Summary Statistics

Linear Probability Model

Regressions

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools

@@ Line 26: / Line 26: @@
   E:\McNair\Projects\MatchingEntrepsToVC\Stata
-contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and output like tex files are in the respective folders.
+contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and raw output like Stata-to-tex tables are in their respective folders. Written reports in .tex are in the Tex folder.
 Regarding Do-Files organization, the first file to be opened has to be 'master.do'. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.
@@ Line 32: / Line 32: @@
 For now, every do-file is more or less self-descriptive and self-contained.
-==Summary==
+==Preliminary Analysis==
+A written report with detailed description of results can be found at
+ E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex
+===Initial Look at Dataset===
+Before attempting to do any statistical analysis, I performed an initial look at the raw dataset to spot possible problems.
+There was a mistake in the synthetic VC's count of startups from the same sector and the current match, ie, variables 'synsumprevsamesector', 'synsumprevsameindu', 'synsumprevsameindu20', 'synsumprevsameindu10', as their values contained lots of -1 and 0s. To correct it, I changed the SQL code.
+More specifically, when creating table 'FirmnameInduBlowout', when doing the JOIN, the weak inequality was changed to strict inequality. Then, when creating the next table, 'FirmnameRoundInduHist', I removed the subtraction. The same was done to the corresponding synthetic tables.
+===Summary Statistics===
+Summary statistics were produced using the 'summarystats.do' do-file.
+===Linear Probability Model===
+A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.
+To perform this regression, it is necessary to build a new dataset. This is done on 'lpmsynthetic.do'.
+At first look, this looks like a simple case of using the -reshape- function in Stata, since the original dataset is on a 'Wide' format, ie, the synthetic VC and its characteristics for each observation (startup) are variables (columns) itself, and we want to make them into observations (rows), with a dummy indicating when it is a real or synthetic match. However, the -reshape- command does not work with string variable names.
+Therefore the do-file performs a manual reshape. After sending the results to Jeremy Fox, he felt that the results were not as expected and suggested some corrections.
+===Regressions===
+We want to know if VCs are more likely to match with geographically close startups, if patents are good signals for VCs, if VCs prefer
+serial founders and startups with similar demographic characteristics. We also want to know if startups prefer to match with VCs that have previous experience on startups of the same sector and VCs that prefer to invest in startups at their stage.
+Since we don't have 'out-of-match' VCs and startups, I decided to do two different types of regressions.
+I regress VCs all-time characteristics on their matched startups characteristics of interest, like distance, patents before match, demographic, etc. I am basically trying to see correlations. If 'good' VCs tend to match with very close startups, that had many patents before match, etc, then we can say there is some evidence of positive assortative matching.
+On the other hand, if 'good' startups matched with VCs that were within their scope of investment, that had a history of investing in
+similar sectors, then these characteriscs are important for the startups.
+Every regression has sector and VC founding year fixed effects.
+Also, for all count variables, I've log-transformed it (adding 1 before to account for zeros) as suggested by Ed Egan. For the distance variable, I've also log-transformed it. Continuous variables are not log-transformed because most of them contains zeros, and adding 1 doesn't seem to make much sense.