edegan.com - User contributions [en]

VC Startup Matching Stata Work

2018-08-22T20:08:22Z

Marcoslee: /* Paralelization of Matlab Code */

{{McNair Projects
|Has title=VC Startup Matching Stata Work
|Has owner=Marcos Ki Hyung Lee,
|Has start date=06/2018
|Has keywords=VC, Stata, Matching, Startup
|Has project status=Active
|Is dependent on=Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists,
}}

Exploratory files and dictionaries, as well as Stata Do-Files and Logs, are located in:

E:\McNair\Projects\MatchingEntrepsToVC\Stata

==Synopsis==

The VC Startup Matching Stata Work Project is support work for the [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]] academic paper.

Estimate reduced form model and summary statistics to 'validate' the dataset for future structural estimation, following the literature evidence pointed out by David Hsu in the academic paper page.

To do so, I use the Startup-VC match dataset which contains variables regarding the startup, the VC and the match itself.

==Stata Do-Files Guide==

The directory

E:\McNair\Projects\MatchingEntrepsToVC\Stata

contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and raw output like Stata-to-tex tables are in their respective folders. Written reports in .tex are in the Tex folder.

Regarding Do-Files organization, the first file to be opened has to be 'master.do'. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.

For now, every do-file is more or less self-descriptive and self-contained.

==Preliminary Analysis==

A written report with detailed description of results can be found at

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex

===Initial Look at Dataset/ SQL code change===

Before attempting to do any statistical analysis, I performed an initial look at the raw dataset to spot possible problems.

There was a mistake in the synthetic VC's count of startups from the same sector and the current match, ie, variables 'synsumprevsamesector', 'synsumprevsameindu', 'synsumprevsameindu20', 'synsumprevsameindu10', as their values contained lots of -1 and 0s. To correct it, I changed the SQL code.

More specifically, when creating table 'FirmnameInduBlowout', when doing the JOIN, the weak inequality was changed to strict inequality. Then, when creating the next table, 'FirmnameRoundInduHist', I removed the subtraction. The same was done to the corresponding synthetic tables.

The code was run again where the tables were used, and a new dataset was created. [[Marcos Ki Hyung Lee (Work Log)]]

===Summary Statistics===

Summary statistics were produced using the 'summarystats.do' do-file.

===Linear Probability Model===

THIS IS OBSOLETE AS OF NEW SQL CODE

A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.

To perform this regression, it is necessary to build a new dataset. This is done on 'lpmsynthetic.do'.

At first look, this looks like a simple case of using the -reshape- function in Stata, since the original dataset is on a 'Wide' format, ie, the synthetic VC and its characteristics for each observation (startup) are variables (columns) itself, and we want to make them into observations (rows), with a dummy indicating when it is a real or synthetic match. However, the -reshape- command does not work with string variable names.

Therefore the do-file performs a manual reshape. After sending the results to Jeremy Fox, he felt that the results were not as expected and suggested some corrections.

===Regressions===

We want to know if VCs are more likely to match with geographically close startups, if patents are good signals for VCs, if VCs prefer
serial founders and startups with similar demographic characteristics. We also want to know if startups prefer to match with VCs that have previous experience on startups of the same sector and VCs that prefer to invest in startups at their stage.

Since we don't have 'out-of-match' VCs and startups, I decided to do two different types of regressions.

I regress VCs all-time characteristics on their matched startups characteristics of interest, like distance, patents before match, demographic, etc. I am basically trying to see correlations. If 'good' VCs tend to match with very close startups, that had many patents before match, etc, then we can say there is some evidence of positive assortative matching.

On the other hand, if 'good' startups matched with VCs that were within their scope of investment, that had a history of investing in
similar sectors, then these characteriscs are important for the startups.

Every regression has sector and VC founding year fixed effects.

Also, for all count variables, I've log-transformed it (adding 1 before to account for zeros) as suggested by Ed Egan. For the distance variable, I've also log-transformed it. Continuous variables are not log-transformed because most of them contains zeros, and adding 1 doesn't seem to make much sense.

==Building new Dataset in SQL for a Linear Probability Model==

E:\McNair\Projects\MatchingEntrepsToVC\Stata\SQL\

To run the linear probability model, we need to build a new dataset. This was partially done in the Stata Do-File explained above, but doing it in SQL will give the opportunity to be more flexible when choosing the synthetic match.

The end result is a table that lists all matches that could have occurred in every possible market, including the real one.

First, we need to exactly define what a market is. In this case, a market consists of all matches that occurred in a year and within a industry sector, usually defined by a code. Therefore, the size and type of market hinges on what industry code is being used.

There are three categories, each one more granular, that defines a startup industry. The broader one is the industry class'with 3 categories, 'Information Technology', 'Medical/Health/Life Science' and 'Non-High Tech'. After that there is the Minor group, with categories such as Communications and Media, Computer Hardware, or Biotech, or Consumer Related. After that, the finer one is the Subgroup, which gets very specific, like Wireless Communication Services or Medical Imaging.

A industry code is then a 4-digit number, where the first belongs to the industry Class, the second to the industry Minor group and the last two to the Subgroup. We aggregate Subgroups with less than 20 observations (ie, number of startups) in an 'Other' category to create 'code20', and an analogous 'code100' for less than 100 observations.

We want to create a table that lists for each unique portco all the firms in its market, ie, active in the year it had its first investment from the real matched VC and that had invested in a portco of the same code100/20 in that year.

After that we can simply append/union the real match table and calculate the variables from the original dataset on this new table.

The code that does this is called 'CreatingLPM_withoutsyn.sql' when using code100, and 'CreatingLPM_withoutsyn_code20.sql'. Augi reworked and streamlined it.

---------------

At the end of the code, we also create a LPM dataset, instead of having to do a manual reshape in Stata.

===Histograms===

The code 'Histograms.sql' exports two tables to

Z:\VentureCapitalData\SDCVCData\vcdb2

called 'DistribCode100.sql' and 'DistribCode20.sql'. After that, I import them into Excel and create histograms to characterize the distribution of market size. The excel file is in

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex

==Reduced Form Analysis of the Dataset==

An extensive reduced form analysis was employed with a lot of back and forth feedback between Jeremy Fox and Ed Egan and me. I documented everything I did in a pdf file generated from LaTeX. Since converting it to Wiki format would be too cumbersome, including converting multiple tables and figures, I've decided it is better to host the latex files and pdf in the folder below

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Pdf

Everything necessary to produce the pdf file is there. Open the .tex file 'regressions.tex' and build it using your preferred latex compiler. A very easy option is to use some online compiler.

==Paralelization of Matlab Code==

This was done by [[Wei Wu]]. I will briefly summarize what he told me.

His main objective was to paralelize as much as possible Chenyu's code in Matlab. Apparently, this was done successfully. What changed is documented in [[Parallelize msmf corr coeff.m]].

He also had two other projects that did not end up working. One was to use the GPU to speed up even more the code. The reasons are well documented in [[Matlab, CUDA, and GPU Computing]].

Finally, he also tried expanding the paralelization by using NOTS (Night Owls Time-Sharing Service), a computing cluster. Since the paralelization was succesful, expanding the number of cores available was the logical next step. He ran into problems which I couldn't understand very well. Additionally, NOTS is not Windows-friendly. Check [[NOTS Computing for Matching Entrepreneurs to VCs]] for more.

VC Startup Matching Stata Work

2018-08-22T20:01:10Z

Marcoslee:

VC Startup Matching Stata Work

2018-08-22T19:51:16Z

Marcoslee: /* Building new Dataset in SQL for a Linear Probability Model */

VC Startup Matching Stata Work

2018-08-22T19:47:12Z

Marcoslee: /* Linear Probability Model */

{{McNair Projects
|Has title=VC Startup Matching Stata Work
|Has owner=Marcos Ki Hyung Lee,
|Has start date=06/2018
|Has keywords=VC, Stata, Matching, Startup
|Has project status=Active
|Is dependent on=Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists,
}}

Exploratory files and dictionaries, as well as Stata Do-Files and Logs, are located in:

E:\McNair\Projects\MatchingEntrepsToVC\Stata

==Synopsis==

The VC Startup Matching Stata Work Project is support work for the [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]] academic paper.

Estimate reduced form model and summary statistics to 'validate' the dataset for future structural estimation, following the literature evidence pointed out by David Hsu in the academic paper page.

To do so, I use the Startup-VC match dataset which contains variables regarding the startup, the VC and the match itself.

==Stata Do-Files Guide==

The directory

E:\McNair\Projects\MatchingEntrepsToVC\Stata

contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and raw output like Stata-to-tex tables are in their respective folders. Written reports in .tex are in the Tex folder.

Regarding Do-Files organization, the first file to be opened has to be 'master.do'. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.

For now, every do-file is more or less self-descriptive and self-contained.

==Preliminary Analysis==

A written report with detailed description of results can be found at

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex

===Initial Look at Dataset/ SQL code change===

Before attempting to do any statistical analysis, I performed an initial look at the raw dataset to spot possible problems.

There was a mistake in the synthetic VC's count of startups from the same sector and the current match, ie, variables 'synsumprevsamesector', 'synsumprevsameindu', 'synsumprevsameindu20', 'synsumprevsameindu10', as their values contained lots of -1 and 0s. To correct it, I changed the SQL code.

More specifically, when creating table 'FirmnameInduBlowout', when doing the JOIN, the weak inequality was changed to strict inequality. Then, when creating the next table, 'FirmnameRoundInduHist', I removed the subtraction. The same was done to the corresponding synthetic tables.

The code was run again where the tables were used, and a new dataset was created. [[Marcos Ki Hyung Lee (Work Log)]]

===Summary Statistics===

Summary statistics were produced using the 'summarystats.do' do-file.

===Linear Probability Model===

THIS IS OBSOLETE AS OF NEW SQL CODE

A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.

To perform this regression, it is necessary to build a new dataset. This is done on 'lpmsynthetic.do'.

At first look, this looks like a simple case of using the -reshape- function in Stata, since the original dataset is on a 'Wide' format, ie, the synthetic VC and its characteristics for each observation (startup) are variables (columns) itself, and we want to make them into observations (rows), with a dummy indicating when it is a real or synthetic match. However, the -reshape- command does not work with string variable names.

Therefore the do-file performs a manual reshape. After sending the results to Jeremy Fox, he felt that the results were not as expected and suggested some corrections.

===Regressions===

We want to know if VCs are more likely to match with geographically close startups, if patents are good signals for VCs, if VCs prefer
serial founders and startups with similar demographic characteristics. We also want to know if startups prefer to match with VCs that have previous experience on startups of the same sector and VCs that prefer to invest in startups at their stage.

Since we don't have 'out-of-match' VCs and startups, I decided to do two different types of regressions.

I regress VCs all-time characteristics on their matched startups characteristics of interest, like distance, patents before match, demographic, etc. I am basically trying to see correlations. If 'good' VCs tend to match with very close startups, that had many patents before match, etc, then we can say there is some evidence of positive assortative matching.

On the other hand, if 'good' startups matched with VCs that were within their scope of investment, that had a history of investing in
similar sectors, then these characteriscs are important for the startups.

Every regression has sector and VC founding year fixed effects.

Also, for all count variables, I've log-transformed it (adding 1 before to account for zeros) as suggested by Ed Egan. For the distance variable, I've also log-transformed it. Continuous variables are not log-transformed because most of them contains zeros, and adding 1 doesn't seem to make much sense.

==Building new Dataset in SQL for a Linear Probability Model==

E:\McNair\Projects\MatchingEntrepsToVC\Stata\SQL\

To run the linear probability model, we need to build a new dataset. This was partially done in the Stata Do-File explained above, but doing it in SQL will give the opportunity to be more flexible when choosing the synthetic match.

The end result is a table that lists all matches that could have occurred in every possible market, including the real one.

First, we need to exactly define what a market is. In this case, a market consists of all matches that occurred in a year and within a industry sector, usually defined by a code. Therefore, the size and type of market hinges on what industry code is being used.

There are three categories, each one more granular, that defines a startup industry. The broader one is the industry class'with 3 categories, 'Information Technology', 'Medical/Health/Life Science' and 'Non-High Tech'. After that there is the Minor group, with categories such as Communications and Media, Computer Hardware, or Biotech, or Consumer Related. After that, the finer one is the Subgroup, which gets very specific, like Wireless Communication Services or Medical Imaging.

A industry code is then a 4-digit number, where the first belongs to the industry Class, the second to the industry Minor group and the last two to the Subgroup. We aggregate Subgroups with less than 20 observations (ie, number of startups) in an 'Other' category to create 'code20', and an analogous 'code100' for less than 100 observations.

We want to create a table that lists for each unique portco all the firms in its market, ie, active in the year it had its first investment from the real matched VC and that had invested in a portco of the same code100/20 in that year.

After that we can simply append/union the real match table and calculate the variables from the original dataset on this new table.

The code that does this is called 'CreatingLPM_withoutsyn.sql' when using code100, and 'CreatingLPM_withoutsyn_code20.sql'

===Histograms===

The code 'Histograms.sql' exports two tables to

Z:\VentureCapitalData\SDCVCData\vcdb2

called 'DistribCode100.sql' and 'DistribCode20.sql'. After that, I import them into Excel and create histograms to characterize the distribution of market size. The excel file is in

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex

VC Startup Matching Stata Work

2018-08-22T19:46:59Z

Marcoslee: /* Linear Probability Model */

{{McNair Projects
|Has title=VC Startup Matching Stata Work
|Has owner=Marcos Ki Hyung Lee,
|Has start date=06/2018
|Has keywords=VC, Stata, Matching, Startup
|Has project status=Active
|Is dependent on=Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists,
}}

Exploratory files and dictionaries, as well as Stata Do-Files and Logs, are located in:

E:\McNair\Projects\MatchingEntrepsToVC\Stata

==Synopsis==

The VC Startup Matching Stata Work Project is support work for the [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]] academic paper.

Estimate reduced form model and summary statistics to 'validate' the dataset for future structural estimation, following the literature evidence pointed out by David Hsu in the academic paper page.

To do so, I use the Startup-VC match dataset which contains variables regarding the startup, the VC and the match itself.

==Stata Do-Files Guide==

The directory

E:\McNair\Projects\MatchingEntrepsToVC\Stata

contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and raw output like Stata-to-tex tables are in their respective folders. Written reports in .tex are in the Tex folder.

Regarding Do-Files organization, the first file to be opened has to be 'master.do'. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.

For now, every do-file is more or less self-descriptive and self-contained.

==Preliminary Analysis==

A written report with detailed description of results can be found at

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex

===Initial Look at Dataset/ SQL code change===

Before attempting to do any statistical analysis, I performed an initial look at the raw dataset to spot possible problems.

There was a mistake in the synthetic VC's count of startups from the same sector and the current match, ie, variables 'synsumprevsamesector', 'synsumprevsameindu', 'synsumprevsameindu20', 'synsumprevsameindu10', as their values contained lots of -1 and 0s. To correct it, I changed the SQL code.

More specifically, when creating table 'FirmnameInduBlowout', when doing the JOIN, the weak inequality was changed to strict inequality. Then, when creating the next table, 'FirmnameRoundInduHist', I removed the subtraction. The same was done to the corresponding synthetic tables.

The code was run again where the tables were used, and a new dataset was created. [[Marcos Ki Hyung Lee (Work Log)]]

===Summary Statistics===

Summary statistics were produced using the 'summarystats.do' do-file.

===Linear Probability Model===

THIS IS OBSELETE AS OF NEW SQL CODE

A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.

To perform this regression, it is necessary to build a new dataset. This is done on 'lpmsynthetic.do'.

At first look, this looks like a simple case of using the -reshape- function in Stata, since the original dataset is on a 'Wide' format, ie, the synthetic VC and its characteristics for each observation (startup) are variables (columns) itself, and we want to make them into observations (rows), with a dummy indicating when it is a real or synthetic match. However, the -reshape- command does not work with string variable names.

Therefore the do-file performs a manual reshape. After sending the results to Jeremy Fox, he felt that the results were not as expected and suggested some corrections.

===Regressions===

We want to know if VCs are more likely to match with geographically close startups, if patents are good signals for VCs, if VCs prefer
serial founders and startups with similar demographic characteristics. We also want to know if startups prefer to match with VCs that have previous experience on startups of the same sector and VCs that prefer to invest in startups at their stage.

Since we don't have 'out-of-match' VCs and startups, I decided to do two different types of regressions.

I regress VCs all-time characteristics on their matched startups characteristics of interest, like distance, patents before match, demographic, etc. I am basically trying to see correlations. If 'good' VCs tend to match with very close startups, that had many patents before match, etc, then we can say there is some evidence of positive assortative matching.

On the other hand, if 'good' startups matched with VCs that were within their scope of investment, that had a history of investing in
similar sectors, then these characteriscs are important for the startups.

Every regression has sector and VC founding year fixed effects.

Also, for all count variables, I've log-transformed it (adding 1 before to account for zeros) as suggested by Ed Egan. For the distance variable, I've also log-transformed it. Continuous variables are not log-transformed because most of them contains zeros, and adding 1 doesn't seem to make much sense.

==Building new Dataset in SQL for a Linear Probability Model==

E:\McNair\Projects\MatchingEntrepsToVC\Stata\SQL\

To run the linear probability model, we need to build a new dataset. This was partially done in the Stata Do-File explained above, but doing it in SQL will give the opportunity to be more flexible when choosing the synthetic match.

The end result is a table that lists all matches that could have occurred in every possible market, including the real one.

First, we need to exactly define what a market is. In this case, a market consists of all matches that occurred in a year and within a industry sector, usually defined by a code. Therefore, the size and type of market hinges on what industry code is being used.

There are three categories, each one more granular, that defines a startup industry. The broader one is the industry class'with 3 categories, 'Information Technology', 'Medical/Health/Life Science' and 'Non-High Tech'. After that there is the Minor group, with categories such as Communications and Media, Computer Hardware, or Biotech, or Consumer Related. After that, the finer one is the Subgroup, which gets very specific, like Wireless Communication Services or Medical Imaging.

A industry code is then a 4-digit number, where the first belongs to the industry Class, the second to the industry Minor group and the last two to the Subgroup. We aggregate Subgroups with less than 20 observations (ie, number of startups) in an 'Other' category to create 'code20', and an analogous 'code100' for less than 100 observations.

We want to create a table that lists for each unique portco all the firms in its market, ie, active in the year it had its first investment from the real matched VC and that had invested in a portco of the same code100/20 in that year.

After that we can simply append/union the real match table and calculate the variables from the original dataset on this new table.

The code that does this is called 'CreatingLPM_withoutsyn.sql' when using code100, and 'CreatingLPM_withoutsyn_code20.sql'

===Histograms===

The code 'Histograms.sql' exports two tables to

Z:\VentureCapitalData\SDCVCData\vcdb2

called 'DistribCode100.sql' and 'DistribCode20.sql'. After that, I import them into Excel and create histograms to characterize the distribution of market size. The excel file is in

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex

Marcos Ki Hyung Lee

2018-08-03T19:20:33Z

Marcoslee:

{{McNair Staff
|position=Research Team
|name=Marcos Ki Hyung Lee
|user_image=marcosleepic.jpg
|degree=Ph.D.
|major=Economics
|class=2022
|join_date=07/2018
|skills=Econometrics,
|interests=Underground urban sub-cultures, Hiking, Jazz, Live Music, Math, Postmodernism in literature, Movies,
|email=marcos.lee@rice.edu
|status=Active
}}
==Summer 2018==
[[Marcos Ki Hyung Lee (Work Log)]]

[[VC Startup Matching Stata Work]]

VC Startup Matching Stata Work

2018-07-30T21:38:37Z

Marcoslee: /* Building new Dataset in SQL for a Linear Probability Model */

{{McNair Projects
|Has title=VC Startup Matching Stata Work
|Has owner=Marcos Ki Hyung Lee,
|Has start date=06/2018
|Has keywords=VC, Stata, Matching, Startup
|Has project status=Active
|Is dependent on=Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists,
}}

Exploratory files and dictionaries, as well as Stata Do-Files and Logs, are located in:

E:\McNair\Projects\MatchingEntrepsToVC\Stata

==Synopsis==

The VC Startup Matching Stata Work Project is support work for the [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]] academic paper.

Estimate reduced form model and summary statistics to 'validate' the dataset for future structural estimation, following the literature evidence pointed out by David Hsu in the academic paper page.

To do so, I use the Startup-VC match dataset which contains variables regarding the startup, the VC and the match itself.

==Stata Do-Files Guide==

The directory

E:\McNair\Projects\MatchingEntrepsToVC\Stata

contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and raw output like Stata-to-tex tables are in their respective folders. Written reports in .tex are in the Tex folder.

Regarding Do-Files organization, the first file to be opened has to be 'master.do'. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.

For now, every do-file is more or less self-descriptive and self-contained.

==Preliminary Analysis==

A written report with detailed description of results can be found at

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex

===Initial Look at Dataset/ SQL code change===

Before attempting to do any statistical analysis, I performed an initial look at the raw dataset to spot possible problems.

There was a mistake in the synthetic VC's count of startups from the same sector and the current match, ie, variables 'synsumprevsamesector', 'synsumprevsameindu', 'synsumprevsameindu20', 'synsumprevsameindu10', as their values contained lots of -1 and 0s. To correct it, I changed the SQL code.

More specifically, when creating table 'FirmnameInduBlowout', when doing the JOIN, the weak inequality was changed to strict inequality. Then, when creating the next table, 'FirmnameRoundInduHist', I removed the subtraction. The same was done to the corresponding synthetic tables.

The code was run again where the tables were used, and a new dataset was created. [[Marcos Ki Hyung Lee (Work Log)]]

===Summary Statistics===

Summary statistics were produced using the 'summarystats.do' do-file.

===Linear Probability Model===

A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.

To perform this regression, it is necessary to build a new dataset. This is done on 'lpmsynthetic.do'.

At first look, this looks like a simple case of using the -reshape- function in Stata, since the original dataset is on a 'Wide' format, ie, the synthetic VC and its characteristics for each observation (startup) are variables (columns) itself, and we want to make them into observations (rows), with a dummy indicating when it is a real or synthetic match. However, the -reshape- command does not work with string variable names.

Therefore the do-file performs a manual reshape. After sending the results to Jeremy Fox, he felt that the results were not as expected and suggested some corrections.

===Regressions===

We want to know if VCs are more likely to match with geographically close startups, if patents are good signals for VCs, if VCs prefer
serial founders and startups with similar demographic characteristics. We also want to know if startups prefer to match with VCs that have previous experience on startups of the same sector and VCs that prefer to invest in startups at their stage.

Since we don't have 'out-of-match' VCs and startups, I decided to do two different types of regressions.

I regress VCs all-time characteristics on their matched startups characteristics of interest, like distance, patents before match, demographic, etc. I am basically trying to see correlations. If 'good' VCs tend to match with very close startups, that had many patents before match, etc, then we can say there is some evidence of positive assortative matching.

On the other hand, if 'good' startups matched with VCs that were within their scope of investment, that had a history of investing in
similar sectors, then these characteriscs are important for the startups.

Every regression has sector and VC founding year fixed effects.

Also, for all count variables, I've log-transformed it (adding 1 before to account for zeros) as suggested by Ed Egan. For the distance variable, I've also log-transformed it. Continuous variables are not log-transformed because most of them contains zeros, and adding 1 doesn't seem to make much sense.

==Building new Dataset in SQL for a Linear Probability Model==

E:\McNair\Projects\MatchingEntrepsToVC\Stata\SQL\

To run the linear probability model, we need to build a new dataset. This was partially done in the Stata Do-File explained above, but doing it in SQL will give the opportunity to be more flexible when choosing the synthetic match.

The end result is a table that lists all matches that could have occurred in every possible market, including the real one.

First, we need to exactly define what a market is. In this case, a market consists of all matches that occurred in a year and within a industry sector, usually defined by a code. Therefore, the size and type of market hinges on what industry code is being used.

There are three categories, each one more granular, that defines a startup industry. The broader one is the industry class'with 3 categories, 'Information Technology', 'Medical/Health/Life Science' and 'Non-High Tech'. After that there is the Minor group, with categories such as Communications and Media, Computer Hardware, or Biotech, or Consumer Related. After that, the finer one is the Subgroup, which gets very specific, like Wireless Communication Services or Medical Imaging.

A industry code is then a 4-digit number, where the first belongs to the industry Class, the second to the industry Minor group and the last two to the Subgroup. We aggregate Subgroups with less than 20 observations (ie, number of startups) in an 'Other' category to create 'code20', and an analogous 'code100' for less than 100 observations.

We want to create a table that lists for each unique portco all the firms in its market, ie, active in the year it had its first investment from the real matched VC and that had invested in a portco of the same code100/20 in that year.

After that we can simply append/union the real match table and calculate the variables from the original dataset on this new table.

The code that does this is called 'CreatingLPM_withoutsyn.sql' when using code100, and 'CreatingLPM_withoutsyn_code20.sql'

===Histograms===

The code 'Histograms.sql' exports two tables to

Z:\VentureCapitalData\SDCVCData\vcdb2

called 'DistribCode100.sql' and 'DistribCode20.sql'. After that, I import them into Excel and create histograms to characterize the distribution of market size. The excel file is in

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex

Marcos Ki Hyung Lee (Work Log)

2018-07-30T21:16:26Z

Marcoslee: /* By Date */

==Summer 2018==

==Notes from Ed==

Please build and link to a project page for the STATA analysis!

Also, if/when you make changes to a sql file, please:
#Run them through or make it clear that you haven't with comments
#Let me know by posting it on a project page and linking to in your work log.
Otherwise, we are both going to be making conflicting changes to the same files.

==By Date==

'''Project Page''': [[VC Startup Matching Stata Work]]

'''2018-07-23 until 07-27:'''

This week was dedicated to refining the Linear Probability Model and the reduced form evidence that I did for Jeremy.

I added extensive notes to the interpretation of the coefficients of each model that was estimated as requested by Jeremy. I adjusted for small technicalities in each model.

Additionally, after a call with Ed, I checked the distribution of market size when using either year-code100 or year-code20 as market definition. See Project Page for more on this.

'''2018-07-11 until 07-20:'''

Basically spent this entire week working out the code to build the LPM dataset. Detailed description of code and dataset on project page.

'''2018-07-11:'''

Created a new SQL code to build the LPM dataset. Sent it to Ed Egan to check.

'''2018-07-12:'''

Sick day.

'''2018-07-11:'''

Skype meeting with Ed Egan to discuss new dataset. Need to build a broader dataset for running the LPM model. Started studying the SQL code and thinking about the necessary changes to get the desired dataset.

'''2018-07-10:'''

Investigated the reasons of why the LPM model is not giving the expected results.

'''2018-07-06:'''

Received new suggestions from Fox.

Started rewriting the report following comments from Fox.

'''2018-07-05:'''

Wrote a report with all the results so far by request from Jeremy Fox and sent it to him.

Also added a LPM model to the analysis, by suggestion from Jeremy Fox, although I suspect there is something wrong with the way I built the dataset needed for it. Emailed my worries to Fox.

After meeting with Ed Egan, changed the SQL code when building the history variables from VCs. Instead of subtracting 1 from the sum of all portcos that worked with the VC, now we do not subtract and instead of using weak inequality when LEFT JOINing, we use strict inequality.

'''2018-07-04:'''

Rework data analysis following suggestions from Egan and Fox.

'''2018-07-03:'''

Worked with regressions and made log files with summary statistics and outputs.

Had Skype meetings with Ed Egan and Jeremy Fox.

'''2018-06-26:'''

Picked relevant variables and started thinking of some regression specifications.

'''2018-06-25:'''

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-07-24T20:41:47Z

Marcoslee: /* By Date */

==Summer 2018==

==Notes from Ed==

Please build and link to a project page for the STATA analysis!

Also, if/when you make changes to a sql file, please:
#Run them through or make it clear that you haven't with comments
#Let me know by posting it on a project page and linking to in your work log.
Otherwise, we are both going to be making conflicting changes to the same files.

==By Date==

'''Project Page''': [[VC Startup Matching Stata Work]]

'''2018-07-11 until 07-20:'''

Basically spent this entire week working out the code to build the LPM dataset. Detailed description of code and dataset on project page.

'''2018-07-11:'''

Created a new SQL code to build the LPM dataset. Sent it to Ed Egan to check.

'''2018-07-12:'''

Sick day.

'''2018-07-11:'''

Skype meeting with Ed Egan to discuss new dataset. Need to build a broader dataset for running the LPM model. Started studying the SQL code and thinking about the necessary changes to get the desired dataset.

'''2018-07-10:'''

Investigated the reasons of why the LPM model is not giving the expected results.

'''2018-07-06:'''

Received new suggestions from Fox.

Started rewriting the report following comments from Fox.

'''2018-07-05:'''

Wrote a report with all the results so far by request from Jeremy Fox and sent it to him.

Also added a LPM model to the analysis, by suggestion from Jeremy Fox, although I suspect there is something wrong with the way I built the dataset needed for it. Emailed my worries to Fox.

After meeting with Ed Egan, changed the SQL code when building the history variables from VCs. Instead of subtracting 1 from the sum of all portcos that worked with the VC, now we do not subtract and instead of using weak inequality when LEFT JOINing, we use strict inequality.

'''2018-07-04:'''

Rework data analysis following suggestions from Egan and Fox.

'''2018-07-03:'''

Worked with regressions and made log files with summary statistics and outputs.

Had Skype meetings with Ed Egan and Jeremy Fox.

'''2018-06-26:'''

Picked relevant variables and started thinking of some regression specifications.

'''2018-06-25:'''

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

VC Startup Matching Stata Work

2018-07-20T21:38:38Z

Marcoslee: /* Building new Dataset in SQL for a Linear Probability Model */

Marcos Ki Hyung Lee (Work Log)

2018-07-19T22:51:01Z

Marcoslee: /* By Date */

==Summer 2018==

==Notes from Ed==

Please build and link to a project page for the STATA analysis!

Also, if/when you make changes to a sql file, please:
#Run them through or make it clear that you haven't with comments
#Let me know by posting it on a project page and linking to in your work log.
Otherwise, we are both going to be making conflicting changes to the same files.

==By Date==

'''Project Page''': [[VC Startup Matching Stata Work]]

'''2018-07-11 until 07-19:'''

Basically spent this entire week working out the code to build the LPM dataset. Detailed description of code and dataset on project page.

'''2018-07-11:'''

Created a new SQL code to build the LPM dataset. Sent it to Ed Egan to check.

'''2018-07-12:'''

Sick day.

'''2018-07-11:'''

Skype meeting with Ed Egan to discuss new dataset. Need to build a broader dataset for running the LPM model. Started studying the SQL code and thinking about the necessary changes to get the desired dataset.

'''2018-07-10:'''

Investigated the reasons of why the LPM model is not giving the expected results.

'''2018-07-06:'''

Received new suggestions from Fox.

Started rewriting the report following comments from Fox.

'''2018-07-05:'''

Wrote a report with all the results so far by request from Jeremy Fox and sent it to him.

Also added a LPM model to the analysis, by suggestion from Jeremy Fox, although I suspect there is something wrong with the way I built the dataset needed for it. Emailed my worries to Fox.

After meeting with Ed Egan, changed the SQL code when building the history variables from VCs. Instead of subtracting 1 from the sum of all portcos that worked with the VC, now we do not subtract and instead of using weak inequality when LEFT JOINing, we use strict inequality.

'''2018-07-04:'''

Rework data analysis following suggestions from Egan and Fox.

'''2018-07-03:'''

Worked with regressions and made log files with summary statistics and outputs.

Had Skype meetings with Ed Egan and Jeremy Fox.

'''2018-06-26:'''

Picked relevant variables and started thinking of some regression specifications.

'''2018-06-25:'''

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-07-13T20:39:16Z

Marcoslee:

==Summer 2018==

==Notes from Ed==

Please build and link to a project page for the STATA analysis!

Also, if/when you make changes to a sql file, please:
#Run them through or make it clear that you haven't with comments
#Let me know by posting it on a project page and linking to in your work log.
Otherwise, we are both going to be making conflicting changes to the same files.

==By Date==

'''Project Page''': [[VC Startup Matching Stata Work]]

'''2018-07-11:'''

Created a new SQL code to build the LPM dataset. Sent it to Ed Egan to check.

'''2018-07-12:'''

Sick day.

'''2018-07-11:'''

Skype meeting with Ed Egan to discuss new dataset. Need to build a broader dataset for running the LPM model. Started studying the SQL code and thinking about the necessary changes to get the desired dataset.

'''2018-07-10:'''

Investigated the reasons of why the LPM model is not giving the expected results.

'''2018-07-06:'''

Received new suggestions from Fox.

Started rewriting the report following comments from Fox.

'''2018-07-05:'''

Wrote a report with all the results so far by request from Jeremy Fox and sent it to him.

Also added a LPM model to the analysis, by suggestion from Jeremy Fox, although I suspect there is something wrong with the way I built the dataset needed for it. Emailed my worries to Fox.

After meeting with Ed Egan, changed the SQL code when building the history variables from VCs. Instead of subtracting 1 from the sum of all portcos that worked with the VC, now we do not subtract and instead of using weak inequality when LEFT JOINing, we use strict inequality.

'''2018-07-04:'''

Rework data analysis following suggestions from Egan and Fox.

'''2018-07-03:'''

Worked with regressions and made log files with summary statistics and outputs.

Had Skype meetings with Ed Egan and Jeremy Fox.

'''2018-06-26:'''

Picked relevant variables and started thinking of some regression specifications.

'''2018-06-25:'''

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

VC Startup Matching Stata Work

2018-07-11T21:14:51Z

Marcoslee: /* Building new Dataset in SQL for a Linear Probability Model */

VC Startup Matching Stata Work

2018-07-11T21:13:51Z

Marcoslee:

VC Startup Matching Stata Work

2018-07-11T18:06:21Z

Marcoslee: /* Preliminary Analysis */

VC Startup Matching Stata Work

2018-07-11T17:48:35Z

Marcoslee:

VC Startup Matching Stata Work

2018-07-11T16:44:19Z

Marcoslee:

Marcos Ki Hyung Lee (Work Log)

2018-07-11T16:29:23Z

Marcoslee: /* By Date */

==Summer 2018==

==Notes from Ed==

Please build and link to a project page for the STATA analysis!

Also, if/when you make changes to a sql file, please:
#Run them through or make it clear that you haven't with comments
#Let me know by posting it on a project page and linking to in your work log.
Otherwise, we are both going to be making conflicting changes to the same files.

==By Date==

'''Project Page''': [[VC Startup Matching Stata Work]]

'''2018-07-10:'''

Investigated the reasons of why the LPM model is not giving the expected results.

'''2018-07-06:'''

Received new suggestions from Fox.

Started rewriting the report following comments from Fox.

'''2018-07-05:'''

Wrote a report with all the results so far by request from Jeremy Fox and sent it to him.

Also added a LPM model to the analysis, by suggestion from Jeremy Fox, although I suspect there is something wrong with the way I built the dataset needed for it. Emailed my worries to Fox.

After meeting with Ed Egan, changed the SQL code when building the history variables from VCs. Instead of subtracting 1 from the sum of all portcos that worked with the VC, now we do not subtract and instead of using weak inequality when LEFT JOINing, we use strict inequality.

'''2018-07-04:'''

Rework data analysis following suggestions from Egan and Fox.

'''2018-07-03:'''

Worked with regressions and made log files with summary statistics and outputs.

Had Skype meetings with Ed Egan and Jeremy Fox.

'''2018-06-26:'''

Picked relevant variables and started thinking of some regression specifications.

'''2018-06-25:'''

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-07-11T16:25:56Z

Marcoslee: /* By Date */

==Summer 2018==

==Notes from Ed==

Please build and link to a project page for the STATA analysis!

Also, if/when you make changes to a sql file, please:
#Run them through or make it clear that you haven't with comments
#Let me know by posting it on a project page and linking to in your work log.
Otherwise, we are both going to be making conflicting changes to the same files.

==By Date==

'''2018-07-10:'''

Investigated the reasons of why the LPM model is not giving the expected results.

'''2018-07-06:'''

Received new suggestions from Fox.

Started rewriting the report following comments from Fox.

'''2018-07-05:'''

Wrote a report with all the results so far by request from Jeremy Fox and sent it to him.

Also added a LPM model to the analysis, by suggestion from Jeremy Fox, although I suspect there is something wrong with the way I built the dataset needed for it. Emailed my worries to Fox.

After meeting with Ed Egan, changed the SQL code when building the history variables from VCs. Instead of subtracting 1 from the sum of all portcos that worked with the VC, now we do not subtract and instead of using weak inequality when LEFT JOINing, we use strict inequality.

'''2018-07-04:'''

Rework data analysis following suggestions from Egan and Fox.

'''2018-07-03:'''

Worked with regressions and made log files with summary statistics and outputs.

Had Skype meetings with Ed Egan and Jeremy Fox.

'''2018-06-26:'''

Picked relevant variables and started thinking of some regression specifications.

'''2018-06-25:'''

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-07-04T17:49:42Z

Marcoslee:

==Summer 2018==

'''2018-07-04:'''

Rework data analysis following suggestions from Egan and Fox.

'''2018-07-03:'''

Worked with regressions and made log files with summary statistics and outputs.

Had Skype meetings with Ed Egan and Jeremy Fox.

'''2018-06-26:'''

Picked relevant variables and started thinking of some regression specifications.

'''2018-06-25:'''

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-07-03T16:08:31Z

Marcoslee:

==Summer 2018==

'''2018-07-03:'''

Worked with regressions and made log files with summary statistics and outputs.

'''2018-06-26:'''

Picked relevant variables and started thinking of some regression specifications.

'''2018-06-25:'''

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-06-26T20:52:11Z

Marcoslee:

==Summer 2018==

'''2018-06-26:'''

Picked relevant variables and started thinking of some regression specifications.

'''2018-06-25:'''

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-06-25T20:31:26Z

Marcoslee:

==Summer 2018==

'''2018-06-25:'''

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee

2018-06-25T17:13:00Z

Marcoslee:

File:Marcosleepic.jpg

2018-06-25T17:12:40Z

Marcoslee:

Marcos Ki Hyung Lee (Work Log)

2018-06-22T22:30:38Z

Marcoslee:

==Summer 2018==

'''2018-06-22:'''

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-06-22T16:20:03Z

Marcoslee:

Marcos Ki Hyung Lee (Work Log)

2018-06-22T15:16:55Z

Marcoslee:

==Summer 2018==

'''2018-06-22:'''

Plans for today: try to fix dataset.

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-06-21T22:23:10Z

Marcoslee: /* Summer 2018 */

==Summer 2018==

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-06-21T15:27:08Z

Marcoslee:

==Summer 2018==

'''2018-06-21:'''

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-06-21T15:26:51Z

Marcoslee: /* Summer 2018 */

==Summer 2018==

'''2018-06-21:
'''
Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

'''2018-06-20:''' Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Marcos Ki Hyung Lee (Work Log)

2018-06-20T21:38:43Z

Marcoslee: Created page with "==Summer 2018== 2018-06-20: Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile."

==Summer 2018==

2018-06-20: Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

VC Startup Matching Stata Work

2018-06-20T20:42:45Z

Marcoslee:

VC Startup Matching Stata Work

2018-06-20T19:47:16Z

Marcoslee:

VC Startup Matching Stata Work

2018-06-20T19:45:23Z

Marcoslee:

VC Startup Matching Stata Work

2018-06-20T19:40:26Z

Marcoslee: Created page with "{{McNair Projects |Has title=VC Startup Matching Stata Work |Has owner=Marcos Ki Hyung Lee, |Has start date=06/2018 |Has keywords=VC, Stata, Matching, Startup |Has project sta..."

Work Logs

2018-06-20T15:14:58Z

Marcoslee:

[[Category: McNair Admin]]

Work Logs are broken down into two divisions of McNair Center, the long-term deliverables of academic papers and short-term deliverables of general content. Individuals working within a division will be listed under the respective one. In case an individual works within both divisions, they will be listed in both locations.

=Academic Papers=

This division of the McNair Center pursues longer-term projects, such as peer-reviewed academic papers.

==Jake Silberman==
{{:Jake Silberman (Work Log)}}

==Will Cleland==
{{:Will Cleland (Work Log)}}

==Todd Rachowin==
{{:Todd Rachowin (Work Log)}}

==Amir Kazempour==
{{:Amir Kazempour (Work Log)}}

==Marcos Ki Hyung Lee==
{{:Marcos Ki Hyung Lee (Work Log)}}

=Research=

==Amir Kazempour==
{{:Amir Kazempour (Work Log)}}

==Ben Baldazo==
{{:Ben Baldazo (Work Log)}}

==Catherine Kirby==
{{:Catherine Kirby (Work Log)}}

==Connor Rothschild==
{{:Connor Rothschild (Work Log)}}

==Diana Carranza==
{{:Diana Carranza (Work Log)}}

==Dylan Dickens==
{{:Dylan Dickens (Work Log)}}

==Hira Farooqi==
{{:Hira Farooqi (Work Log)}}

==James Chen==
{{:James Chen (Work Log)}}

==Joe Reilly==
{{:Joe Reilly (Work Log)}}

==Julia Wang==
{{:Julia Wang (Work Log)}}

==Matthew Ringheanu==
{{:Matthew Ringheanu (Work Log)}}

==Meghana Gaur==
{{:Meghana Gaur (Work Log)}}

==Shrey Agarwal==
{{:Shrey Agarwal (Work Log)}}

==Taylor Jacobe==
{{:Tay Jacobe (Work Log)}}

==Yunnie Huang==
{{:Yunnie Huang (Work Log)}}

=Technical=

==Christy Warden==
{{:Christy Warden (Work Log)}}

==Harrison Brown==
{{:Harrison Brown (Work Log)}}

==Minh Le==
{{:Minh Le (Work Log)}}

==Jeemin Sim==
{{:Jeemin Sim (Work Log)}}

==Kyran Adams==
{{:Kyran Adams (Work Log)}}

==Oliver Chang==
{{:Oliver Chang (Work Log)}}

==Peter Jalbert==
{{:Peter Jalbert (Work Log)}}

==Shelby Bice==
{{:Shelby Bice (Work Log)}}

==Yang Zhang==
{{:Yang Zhang (Work Log)}}

=Administrative=

==Cindy Ryoo==
{{:Cindy Ryoo (Work Log)}}

==Lin Yang==
{{:Lin Yang (Work Log)}}

==Michelle Huang==
{{:Michelle Huang (Work Log)}}

=Archive=

This is the work log for archived members.

==Abhijit Brahme==
{{:Abhijit Brahme (Work Log)}}

==Adrian Smart==
{{:Adrian Smart (Work Log)}}

==Albert Nabiullin==
{{:Albert Nabiullin (Work Log)}}

==Alex Jiang==
{{:Alex Jiang (Work Log)}}

==Ariel Sun==
{{:Ariel Sun (Work Log)}}

==Avesh Krishna==
{{:Avesh Krishna (Work Log)}}

==Carlin Cherry==
{{:Carlin Cherry (Work Log)}}

==Claudio Sanchez-Nieto==
{{:Claudio Sanchez-Nieto (Work Log)}}

==Dan Lee==
{{:Dan Lee (Work Log)}}

==David Zhang==
{{:David Zhang (Work Log)}}

==Eliza Martin==
{{:Eliza Martin (Work Log)}}

==Gunny Liu==
{{:Gunny Liu (Work Log)}}

==Harsh Upadhyay==
{{:Harsh Upadhyay (Work Log)}}

==Iris Huang==
{{:Iris Huang (Work Log)}}

==Jackie Li==
{{:Jackie Li (Work Log)}}

==Jake Floyd==
{{:Jake Floyd (Work Log)}}

==Jason Isaacs==
{{:Jason Isaacs (Work Log)}}

==Juliette Richert==
{{:Juliette Richert (Work Log)}}

==Kerda Veraku==
{{:Kerda Veraku (Work Log)}}

==Komal Agarwal==
{{:Komal Agarwal (Work Log)}}

==Kranthi Pandiri==
{{:Kranthi Pandiri (Work Log)}}

==Kunal Shah==
{{:Kunal Shah (Work Log)}}

==Lauren Bass==
{{:Lauren Bass (Work Log)}}

==Leo Du==
{{:Leo Du (Work Log)}}

==Mallika Miglani==
{{:Mallika Miglani (Work Log)}}

==Marcela Interiano==
{{:Marcela Interiano (Work Log)}}

==Meghana Pannala==
{{: Meghana Pannala (Work Log)}}

==Napas Udomsak==
{{: Napas Udomsak (Work Log)}}

==Pedro Alvarez==
{{:Pedro Alvarez (Work Log)}}

==Rachel Garber==
{{:Rachel Garber (Work Log)}}

==Ramee Saleh==
{{:Ramee Saleh (Work Log)}}

==Ravali Kruthiventi==
{{:Ravali Kruthiventi (Work Log)}}

==Sahil Patnayakuni==
{{:Sahil Patnayakuni (Work Log)}}

==Shoeb Mohammed==
{{:Shoeb Mohammed (Work Log)}}

==Sonia Zhang==
{{:Sonia Zhang (Work Log)}}

==Su Chen Teh==
{{:Su Chen Teh (Work Log)}}

==Todd Rachowin==
{{:Todd Rachowin (Work Log)}}

==Veeral Shah==
{{:Veeral Shah (Work Log)}}

==Will Cleland==
{{:Will Cleland (Work Log)}}

==Yael Hochberg==
{{:Yael Hochberg (Work Log)}}

==Yimeng Tang==
{{:Yimeng Tang (Work Log)}}

Marcos Ki Hyung Lee

2018-06-20T15:08:48Z

Marcoslee:

{{McNair Staff
|position=Research Team
|name=Marcos Ki Hyung Lee
|user_image=mypic.jpeg
|degree=Ph.D.
|major=Economics
|class=2022
|join_date=07/2018
|skills=Econometrics,
|interests=Underground urban sub-cultures, Hiking, Jazz, Live Music, Math, Postmodernism in literature, Movies,
|email=marcos.lee@rice.edu
|status=Active
}}
==Summer 2018==
[[Marcos Ki Hyung Lee (Work Log)]]

Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists

2018-06-19T21:12:35Z

Marcoslee:

{{AcademicPaper
|Has title=Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists
|Has author=Ed Egan, Jeremy Fox, David Hsu, Chenyu Yang,
|Has RAs=Meghana Gaur, James Chen, Kyran Adams, Marcos Ki Hyung Lee,
|Has paper status=Draft
}}
==Reference Papers==

Jeremy's paper with David Hsu and Chenyu Yang is here: [http://fox.web.rice.edu/working-papers/fox-hsu-yang-matching.pdf Unobserved Heterogeneity in Matching Games with an Application to Venture Capital].

'''Abstract:''' Agents in two-sided matching games vary in characteristics that are unobservable in typical data
on matching markets. We investigate the identification of the distribution of unobserved characteristics
using data on who matches with whom. In full generality, we consider many-to-many
matching and matching with trades. The distribution of match-specific unobservables cannot be
fully recovered without information on unmatched agents, but the distribution of a combination
of unobservables, which we call unobserved complementarities, can be identified. Using data on
unmatched agents restores identification. We estimate the contribution of observables and unobservable
complementarities to match production in venture capital investments in biotechnology
and medical firms.

[[Fox Hsu Yang (2015) - Unobserverd Heterogeneity in Matching Games with an Application to Venture Capital]] provides some notes.

==Matlab Code==

[[Abhijit Brahme (Work Log)]] contains his notes on working with the Matlab code. There is a seperate page here: [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]].

==Data specification==

The data spec sent to Jeremy is in:
Z:\Projects\MatchingAcceleratorsToVCs

==Data foundations==

The database is '''vcdb2'''

The foundational tables were built using:
Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessData2.sql

The documentation, which is a little messy, is on [[VC Database Rebuild]]

Our SQL script, which builds on top of the above database (still in vcdb2) is in:
E:\McNair\Projects\MatchingEntrepsToVC\DataWork

==Dataset build==

===Decisions===

Decisions we need to make:
*Will we need synthetic matches? If so what we do we do for outcomes? Can still do dyadic and left/right pair variables.
*Granularity of industry: To start let's use minor industry group (see below). We use a much finer grained industry definition and aggregate back up to balance out the counts somewhat later.
*Matching to a fund or a firm: For now, we will work with funds, though deals are sometimes transferred across funds within a firm (i.e. from Kliener fund IV to Kliener fund V), this is probably comparatively rare (check!).
*Dealing with the right censorship problem: We can likely address this with indicator variables to condition on, but may want to restrict estimation to dyads that don't have this issue. For now we will take portfolio companies that received their last investment before 2007, to allow funds a full 10 years to clear their portfolios.
*Inadequate coverage in early years: VentureXpert's coverage is notably inferior prior to 1982. We should start with portco that received their first investment in 1985 and forward.
*Determination of lead VC - see below
*How to collapse VC rounds (date, amount, etc.): We will use only seed, early, later stage investment and insist on the presence of seed/early for inclusion. We can then have date first, investment duration (to date last), total investment.

===Objective dataset description===

Unit of observation - a startup-fund match.

Constraints:
*PortCo name disclosed
*PortCo date of first investment >= 1/1/1985
*PortCo date of last investment <= 2007 to allow 10 yrs for the funds
*PortCo received at least one round of Seed or Early stage investment
*Matched VC is not undisclosed

Variables:

Startup:
*PortCo ID
*PortCo Name
*Longitude, latitude,
*State of inc., industry, year of founding, year of first investment, year of last investment
*SEL $invested, SEL num rounds, transactional VC indicator and $inv, investment duration SEL (yrs)
*Exit indicator, exit value, exit type indicator
*alive2016 indicator, last round pre-2012 indicator
*total MOOMI (Money Out Over Money In)

Fund:
*fund ID
*fund name
*Number of funds investing (SEL)
*As averages (?) and for lead:
**Fund ipo count, Fund M&A count, Fund investment count(calc at end), fund ipo rate, fund M&A rate, fund exit count, fund exit rate, fund ipo $, fund M&A $, fund exit $, fund fraction of MOOMI.
*Total invested by lead, number of rounds participation by lead, stage of participation of lead, location of lead, last investment pre-2012 indicator, lead fund type indicator (corp, priv, gov, etc.), lead fund size, lead fund vintage year.

Dyadic variables:
*Distance between lead and portco,
*industry preference match between lead and portco
*maybe stage-match (doesn't make a lot of sense when collapsing rounds) between lead and port co.

===Identifying lead VCs===

Possible methods:
*Best performing participant (on exit count/value or fractional MOOMI) with tie-breaker
*Closest participant (using great circle distance)
*Most frequent participant with tie-breaker
*Participant with greatest investment with tie-breaker
*Participant in earliest round that stayed in for longest with tie-breaker

===Minor Industry===

Across all time and without regard to SEL vs. transaction, here's the minor industry list and counts:

indminorgroup | count
-------------------------------+-------
Industrial/Energy | 2871
Internet Specific | 8794
Biotechnology | 2592
Semiconductors/Other Elect. | 2402
Other Products | 4891
Computer Hardware | 2061
Computer Software and Services | 10550
Communications and Media | 3271
Medical/Health | 4373
Consumer Related | 3161

==Literature from David==

Literature to "validate" our sample. I think you probably know the papers I reference below (let me know if you need any of them-some for which I am coauthor you can get from my website).

#VCs are more likely to match with geographically proximate startups (Lerner on corporate governance, Sorenson on geography)
#Startups prefer to match with VCs with domain experience within their startup sector (Morten Sorensen), possibly also prefer to match by stage of VC specialization relative to their own stage of development (not sure which paper if any documents that)
#Startup patents signal VCs (Hsu/Ziedonis in SMJ)
#VCs prefer serial founders, or at least may interact differently with founders based on their prior founding experience (Hsu 2007 in Research Policy)
#If we have access to more individual data: VCs prefer to invest in founders with similar demographic characteristics relative to their own characteristics (Gompers et al within the past few years in JFE, Bengtsson and Hsu in JBV within the last few years).

==Work Done in Late November by Dylan & Ed==

SBIR Data taken from McNair\Projects\SBIR\Data\Aggregate SBIR\SBIR.txt. -Note! This file needed to be opened in excel to be readable, and took a very long time to open due to its large size. SBIR firm names converted to a pivot table to eliminate exact repeat entries, and then exported to a txt file, NSBIR. NSBIR then matched using The Matcher in mode 2 with the following code:
"-file1="NSBIR.txt" -file2="NSBIR.txt" -mode=2"

Output then placed in:
McNair\Projects\MatchingEntrepsToVC\Matching\Output

The original pre-matched, cleaned NSBIR.txt file is moved to:
McNair\Projects\MatchingEntrepsToVC\Matching\Input.

There is a sql file to extract VC portcos (SEL backed only), with key info from vcdb2, and distinct assignee names from allpatentsprocessed here:
E:\McNair\Projects\MatchingEntrepsToVC\Matching

There are three input files:
*distinctNSBIR.txt - made by pivot tabling SBIR.txt from the SBIR aggregation project
*distinctassignees.txt - extracted as distinct from allpatentsprocessed
*vcbackedselcokeys.txt - extracted with key info from vcdb2. It needs pivot tabling to get unique names.

These .txt files were made distinct, and then matched against themselves for normalization. The normalized files still need to be matched against each other. They are located in:
McNair\Projects\MatchingEntrepsToVC\Matching\Normalized

These normalized files were then matched against each other. Approximately 12,000 matches. they are located in:
McNair\Projects\MatchingEntrepsToVC\Matching\Normalized & Matched

Marcos Ki Hyung Lee

2018-06-18T15:47:16Z

Marcoslee:

{{McNair Staff
|position=Research Team
|name=Marcos Ki Hyung Lee
|user_image=mypic.jpeg
|degree=Ph.D.
|major=Economics
|class=2022
|join_date=07/2018
|skills=Econometrics,
|interests=Underground urban sub-cultures,
|email=marcos.lee@rice.edu
|status=Active
}}

==Summer 2018==
[[Marcos Ki Hyung Lee (Work Log)]]

User:Marcoslee

2018-06-18T15:40:40Z

Marcoslee: Redirected page to Marcos Ki Hyung Lee

#REDIRECT [[Marcos Ki Hyung Lee]]

Economics PhD Student at Rice University. Interests in Labor Economics, and Applied Microeconomics in general.

BA and MA in Economics for the University of Sao Paulo.

bla bla bla bla bla bla bla bla

bla bla bla bla bla bla bla bla

bla bla bla bla bla bla bla bla

Marcos Ki Hyung Lee

2018-06-18T15:32:18Z

Marcoslee:

Marcos Ki Hyung Lee

2018-06-18T15:31:23Z

Marcoslee: Created page with "{{McNair Staff |position=Research Team |name=Marcos Ki Hyung Lee |user_image=mypic.jpeg |degree=Ph.D. |major=Economics |class=2022 |join_date=07/2018 |skills=Econometrics, |in..."