Difference between revisions of "VCDB20"
Line 8: | Line 8: | ||
==Data design== | ==Data design== | ||
− | I followed the same data design as in [[VCDB20H1]]. Essentially the specification pulls everything, even things that aren't needed like incomplete M&As or withdrawn IPOs, or funds or investments that aren't venture capital, all the way to the present (the pull was done on 2020-01-04), and then place restrictions on the data later. Crucially, the pulls no longer use the venture-related flag, so the data contains private equity and other deals, as well as secondaries and purchases. Note that the M&As are pulled separately for public and private acquirers, and in chunks by year to keep the request sizes manageable. | + | I followed the same data design as in [[VCDB20H1]]. Essentially the specification pulls everything, even things that aren't needed like incomplete M&As or withdrawn IPOs, or funds or investments that aren't venture capital, all the way to the present (the pull was done on 2020-01-04), and then place restrictions on the data later. Crucially, the pulls no longer use the venture-related flag, so the data contains private equity and other deals, as well as secondaries and purchases. Note that the M&As are pulled separately for public and private acquirers, and in chunks by year to keep the request sizes manageable. To update these scripts change: |
+ | *The end date | ||
+ | *The file locations | ||
==Processing Steps== | ==Processing Steps== |
Revision as of 21:12, 4 January 2021
VCDB20 | |
---|---|
Project Information | |
Has title | VCDB20 |
Has owner | Ed Egan |
Has start date | |
Has deadline date | |
Has project status | |
Has project output | Data, Tool, How-to |
Copyright © 2019 edegan.com. All Rights Reserved. |
The VCDB20 project documents a build of my VCDB -- Venture Capital DataBase -- covering until the end of 2020. Each VCDB includes investments, funds, startups, executives, exits, locations, and more, derived from data from VentureXpert. This project updates vcdb4, which covered (almost) to the of Q3 2019, and replaces VCDB20H1 and [[VCDB20Q3], which were partial builds. See also: SDC Normalizer.
Data design
I followed the same data design as in VCDB20H1. Essentially the specification pulls everything, even things that aren't needed like incomplete M&As or withdrawn IPOs, or funds or investments that aren't venture capital, all the way to the present (the pull was done on 2020-01-04), and then place restrictions on the data later. Crucially, the pulls no longer use the venture-related flag, so the data contains private equity and other deals, as well as secondaries and purchases. Note that the M&As are pulled separately for public and private acquirers, and in chunks by year to keep the request sizes manageable. To update these scripts change:
- The end date
- The file locations
Processing Steps
Source Files
Copy over and update the source files:
Name ---- NormalizeFixedWidth.pl RoundOnOneLine.pl USFirmBranchOffices1980.rpt USFirmBranchOffices1980.ssh USFirms1980.rpt USFirms1980.ssh USFund1980.rpt USFund1980.ssh USFundExecs1980.rpt USFundExecs1980.ssh USIPO1980.rpt USIPO1980.ssh USMAPrivate.rpt USMAPrivate00-10.ssh USMAPrivate10-.ssh USMAPrivate80-85.ssh USMAPrivate85-00.ssh USMAPublic.rpt USMAPublic00-.ssh USMAPublic80-00.ssh USPortCo1980.rpt USPortCo1980.ssh USPortCoExecs1980.rpt USPortCoExecs1980.ssh USPortCoLongDesc1980.rpt USPortCoLongDesc1980.ssh USRound1980.rpt USRound1980.ssh USRoundOnOneLine1980.rpt USRoundOnOneLine1980.ssh
Update the paths and dates in the ssh files then run them (see SDC Platinum).
Database import
Run the SDC Normalizer on each of the files. For most of them, that's straightforward. You can safely ignore the Access Violation error messages that occur at the end of some pulls. However, the following require attention:
- Fix the header row in USFirms1980.txt before normalizing (the Capital Under Management column name is too long)
- Remove double quotes from USFund1980-normal.txt, USFundExecs1980-normal.txt, USPortCo1980-normal.txt
- The private and public M&A file sets have to be (separately) combined after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t.
- For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first then RoundOnOneLine.pl, and then fix the header.
- The PortCo Long Description needs to be pre-processed from the command line and then post-processed in excel (see VCDB20H1 and Vcdb4#Long_Description).
Create the dbase as a researcher:
createdb vcdb20
Move the files to //mother/bulk/vcdb20 and run the load script:
E:\projects\vcdb20\Load.sql
Create the keys
Standardize company names using Hall normalization without a fuzzy algorithm (see The Matcher) and matching them to themselves for PortCos, M&As, IPOs. Then create keys for each table as follows:
- PortCo: Coname, statecode, datefirstinv
- M&A (private targets): targetname, statecode, announceddate
- IPOs: issuer, statecode, issuedate