Marcos Ki Hyung Lee (Work Log)

Jump to navigation Jump to search

Summer 2018

Notes from Ed

Please build and link to a project page for the STATA analysis!

Also, if/when you make changes to a sql file, please:

  1. Run them through or make it clear that you haven't with comments
  2. Let me know by posting it on a project page and linking to in your work log.

Otherwise, we are both going to be making conflicting changes to the same files.

By Date

Project Page: VC Startup Matching Stata Work

2018-07-23 until 07-27:

This week was dedicated to refining the Linear Probability Model and the reduced form evidence that I did for Jeremy.

I added extensive notes to the interpretation of the coefficients of each model that was estimated as requested by Jeremy. I adjusted for small technicalities in each model.

Additionally, after a call with Ed, I checked the distribution of market size when using either year-code100 or year-code20 as market definition. See Project Page for more on this.

2018-07-11 until 07-20:

Basically spent this entire week working out the code to build the LPM dataset. Detailed description of code and dataset on project page.


Created a new SQL code to build the LPM dataset. Sent it to Ed Egan to check.


Sick day.


Skype meeting with Ed Egan to discuss new dataset. Need to build a broader dataset for running the LPM model. Started studying the SQL code and thinking about the necessary changes to get the desired dataset.


Investigated the reasons of why the LPM model is not giving the expected results.


Received new suggestions from Fox.

Started rewriting the report following comments from Fox.


Wrote a report with all the results so far by request from Jeremy Fox and sent it to him.

Also added a LPM model to the analysis, by suggestion from Jeremy Fox, although I suspect there is something wrong with the way I built the dataset needed for it. Emailed my worries to Fox.

After meeting with Ed Egan, changed the SQL code when building the history variables from VCs. Instead of subtracting 1 from the sum of all portcos that worked with the VC, now we do not subtract and instead of using weak inequality when LEFT JOINing, we use strict inequality.


Rework data analysis following suggestions from Egan and Fox.


Worked with regressions and made log files with summary statistics and outputs.

Had Skype meetings with Ed Egan and Jeremy Fox.


Picked relevant variables and started thinking of some regression specifications.


Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.


Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.


Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

2018-06-20: Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.