Changes

Jump to navigation Jump to search
no edit summary
{{Project|Has project output=Tool|Has sponsor=McNair ProjectsCenter
|Has title=VC Startup Matching Stata Work
|Has owner=Marcos Ki Hyung Lee,
===Linear Probability Model===
 
THIS IS OBSOLETE AS OF NEW SQL CODE
A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.
To run the linear probability model, we need to build a new dataset. This was partially done in the Stata Do-File explained above, but doing it in SQL will give the opportunity to be more flexible when choosing the synthetic match.
The end result is a table that lists all matches that could have occurred in every possible market, including the real one. First, we need to exactly define what a market is. In this case, a market consists of all matches that occurred in a year and within a industry sector, usually defined by a code. Therefore, the size and type of market hinges on what industry code is being used. There are three categories, each one more granular, that defines a startup industry. The broader one is the industry class'with 3 categories, 'Information Technology', 'Medical/Health/Life Science' and 'Non-High Tech'. After that there is the Minor group, with categories such as Communications and Media, Computer Hardware, or Biotech, or Consumer Related. After that, the finer one is the Subgroup, which gets very specific, like Wireless Communication Services or Medical Imaging. A industry code is then a 4-digit number, where the first belongs to the industry Class, the second to the industry Minor group and the last two to the Subgroup. We aggregate Subgroups with less than 20 observations (ie, number of startups) in an 'Other' category to create 'code20', and an analogous 'code100' for less than 100 observations. We want to create a table that lists for each unique portco all the firms in its eligible firmsmarket, ie, active in the year it had its first investment from the real matched VC and that had invested in a portco of the same code100 /20 in that year.
After that we can simply append/union the real match table and calculate the variables from the original dataset on this new table.
The final code that does this is called 'CreatingLPM_withoutsyn.sql'when using code100, and 'CreatingLPM_withoutsyn_code20.sql'. Augi reworked and streamlined it. --------------- At the end of the code, we also create a LPM dataset, instead of having to do a manual reshape in Stata.  ===Histograms=== The code 'Histograms.sql' exports two tables to  Z:\VentureCapitalData\SDCVCData\vcdb2 called 'DistribCode100.sql' and 'DistribCode20.sql'. After that, I import them into Excel and create histograms to characterize the distribution of market size. The excel file is in  E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex  ==Reduced Form Analysis of the Dataset== An extensive reduced form analysis was employed with a lot of back and forth feedback between Jeremy Fox and Ed Egan and me. I documented everything I did in a pdf file generated from LaTeX. Since converting it to Wiki format would be too cumbersome, including converting multiple tables and figures, I've decided it is better to host the latex files and pdf in the folder below  E:\McNair\Projects\MatchingEntrepsToVC\Stata\Pdf Everything necessary to produce the pdf file is there. Open the .tex file 'regressions.tex' and build it using your preferred latex compiler. A very easy option is to use some online compiler. ==Paralelization of Matlab Code== This was done by [[Wei Wu]]. I will briefly summarize what he told me. His main objective was to paralelize as much as possible Chenyu's code in Matlab. Apparently, this was done successfully. What changed is documented in [[Parallelize msmf corr coeff.m]]. He also had two other projects that did not end up working. One was to use the GPU to speed up even more the code. The reasons are well documented in [[Matlab, CUDA, and GPU Computing]].  Finally, he also tried expanding the paralelization by using NOTS (Night Owls Time-Sharing Service), a computing cluster. Since the paralelization was succesful, expanding the number of cores available was the logical next step. He ran into problems which I couldn't understand very well. Additionally, NOTS is not Windows-friendly. Check [[NOTS Computing for Matching Entrepreneurs to VCs]] for more.

Navigation menu