[[VCDB24]] is the 2024 and final iteration of my [[VentureXpert]] based '''V'''enture '''C'''apital '''D'''ata'''B'''ase. Thomson-Reuters discontinued access to VentureXpert through [[SDC Platinum]] on December 31st, 2023 (see also: [[SDC Normalizer]]). This iteration contains data up until then. Each VCDB includes investments, funds, startups, executives, exits, locations, and more. The previous build was [[VCDB23]], but the best previous instructions are from [[VCDB20]] or the [[McNair Center]] build, which was called [[VentureXpert Data]].
== Processing Steps ==
cat Out6.txt | perl -pe 's/\s{2,}/ /g' > Out7.txt
cat Out7.txt | perl -pe 's/###/\t/g' > USPortCoLongDesc1980Cleaned.txt
===Geocoding===
Part of Load.sql requires we update the Geocoding - adding new long and lat for PortCos and firm offices that we haven't seen before.
The last time this was run was vcdb20. Accordingly:
* In vcdb20, export the portcogeo, firmgeo, and bogeo tables
* Import them as portcogeo_vcdb20, firmgeo_vcdb20, and bogeo_vcdb20
* Build portcogrowthneedsgeo, firmneedsgeo, firmboneedsgeo files for geocoding
* Log into [https://console.cloud.google.com/ Google Console] and set up an API key. Note that:
** Up to $200/month should be free
** $5.00 USD per 1000 lookups.
** 3,000 QPM max
* In E:/tools/Geocode run the script(s): Geocode.py for portcos and GeocodeOneKey.py for everything else.
** Strip the header line out of the input file(s)
** python Geocode.py portcogrowthneedsgeo-NoHeader.txt
* Get the latest Gazetteer file(s): https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html
* Check the coverage of portcogeo and create firmbogeoplus
===Exits===
Another part of Load.sql does the matching to IPOs and MAs and their precedence. Note that:
* Issuer and target names are matched against themselves using the [[Matcher.pl]] script in mode 2
* PortCo stdnames are matched to issuerstd and targetstd (separately) in mode 0
* The state and date matching requirements are in the load.sql.
* There seems to have been duplicate issue records in the IPO data for vcdb2020 (and perhaps earlier). Some of the duplicates are often identical, except that the date is a day apart.
* The IPO records also contain listings on junior and foreign exchanges, as well as some OTC - I left these in and flagged them.
===Industry===
The industry coding is in IndustryCodes.txt. Note that:
* The code map had to be updated. The Excel file is in projects/vcdb24 but I didn't have the original counts to hand.
* The codes are not unique (576 out of 585 are unique) at the industry subgroup 3 level.
* The codes (code, code20, code100) should be joined using indclass, indminorgroup, indsubgroup
* The codes are 1,2,4 but not 3dg hierarchical.
* 1dg codes are IT, LS, and Other.
* Note that code is the full 4dg industry identifier, where as code20 and code100 are name-based aggregates with at least 20 or 100 observations in them.
{| class="wikitable"
|-
! 2dg Code
! Minorcode
! No. of PortCos
|-
| 11
| Communications and Media
| 3930
|-
| 12
| Computer Hardware
| 3058
|-
| 13
| Computer Software and Services
| 21157
|-
| 14
| Internet Specific
| 14440
|-
| 15
| Semiconductors/Other Elect.
| 3145
|-
| 21
| Biotechnology
| 4251
|-
| 22
| Medical/Health
| 7138
|-
| 31
| Consumer Related
| 7459
|-
| 32
| Industrial/Energy
| 7028
|-
| 33
| Other Products
| 14246
|}
I also tried some keyword industry coding from both short and long descriptions. The source code is at the top of BuildBaseTables.sql. The results are in sheets in the IndustryCodes.xlsx file.
===BuildBaseTables.sql===
Build the PortCoGrowthGeoId table that codes the city-state to a geoid.
{| class="wikitable"
|- style="vertical-align:bottom;"
! origin
! count
! Method
|-
| style="vertical-align:bottom;" | 1
| style="vertical-align:bottom;" | 45,111
| style="color:#808080;" | Address is geocoded and in tiger place
|-
| style="vertical-align:middle; color:#808080;" | 2
| style="vertical-align:bottom;" | 270
| style="color:#808080;" | city, statecode matches to only 1 geoid, so use it
|-
| style="vertical-align:middle; color:#808080;" | 3
| style="vertical-align:bottom;" | 1,374
| style="color:#808080;" | city, statecode matches to multiple geoids, use the most popular
|-
| style="vertical-align:middle; color:#808080;" | 4
| style="vertical-align:bottom;" | 964
| style="color:#808080;" | 1:1 straight city<->place and statecode match with tiger
|-
| style="vertical-align:middle; color:#808080;" | 5
| style="vertical-align:bottom;" | 509
| style="color:#808080;" | Use zctaplaceinfo to lookup the best place choice for the zipcode
|- style="vertical-align:bottom;"
| style="vertical-align:middle; color:#808080;" | 6
| 636
| style="color:#808080;" | Unable to code
|-
| style="vertical-align:middle; color:#808080;" | 9
| style="vertical-align:bottom;" | 24
| style="color:#808080;" | Custom coded
|}
===StartupCities===
The original Startup Cities code is in E:\projects\BayesianStartupCities\V1\startupcities.sql. The new version is in e:\projects\BayesianStartupCities\StartupCitiesV2.sql.