[[VCDB24]] is the 2024 and final iteration of my [[VentureXpert]] based venture capital database'''V'''enture '''C'''apital '''D'''ata'''B'''ase. Thomson-Reuters discontinued access to VentureXpert through [[SDC Platinum]] on December 31st, 2023(see also: [[SDC Normalizer]]). This iteration contains data up until then. Each VCDB includes investments, funds, startups, executives, exits, locations, and more. The previous build was [[VCDB23]], but the best previous instructions are from [[VCDB20]] or the [[McNair Center]] build, which was called [[VentureXpert Data]].
== Processing Steps ==
Get the source data:# Copy over the rpt, ssh, and pl files, and bulk edit the ssh files, now in to E:\projects\vcdb24\SDC, and bulk edit the ssh files. ## Change Make final date 12/31/2020 (2023 and one 07/20/2020) change vcdb23 to 12/31/2022 and vcdb20 to vcdb23vcdb24# Run the ssh files against SDC Platinum. Note that SDC Platinum's service will be withdrawn one last time on 31 December 2023.
# Run the [[SDC Normalizer]] script (one of the pl files) on each output
## Fix the header row in USFirms1980.txt before normalizing (the Capital Under Management column name is too long)
## The private and public M&A file sets have to be separately combined into 2 files after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t in each.
## For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first, then RoundOnOneLine.pl, and then fix the header.
## PortCoLongDescription must be pre-processed from the command line and then post-processed in excel (see below as well as [[VCDB20H1 ]] and [[Vcdb4#Long_Description]]). However, I didn't load it for this run. Create the postgres database:# Create a new database on mother (createdb vcdb23vcdb24) and setup set up a directory for the input files: E: bulk\projects\vcdb23vcdb24# Copy over (to sql folder) and edit Load.sql. Run it section-by-section. ===PortCoLongDescription=== Process the Long Description data as follows:#Remove the header and footer, and then save as Process.txt using UNIX line endings and UTF-8 encoding.#Run the first section (producing Out5.txt) of the regex process below#Import into Excel to make tab-delimited#Remove double quotes " from just the description field #Put in a new header#Save as In5.txt with UNIX/UTF-8#Run the last regex. It deals with the spaces in the description and the cases when there is no description.#Try importing USVCPortCoLongDesc1980Cleaned.txt. It should be fine. cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt cat Out1.txt | perl -pe 's/\s{65,}/ /g' > Out2.txt cat Out2.txt | perl -pe 's/\n//g' > Out3.txt cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt cat Out4.txt | perl -pe 's/(\d{4} $/\1\t/g' > Out5.txt ... cat In5.txt | perl -pe 's/(\d{4})\t$/\1###/g' > Out6.txt cat Out6.txt | perl -pe 's/\s{2,}/ /g' > Out7.txt cat Out7.txt | perl -pe 's/###/\t/g' > USPortCoLongDesc1980Cleaned.txt ===Geocoding=== Part of Load.sql requires we update the Geocoding - adding new long and lat for PortCos and firm offices that we haven't seen before. The last time this was run was vcdb20. Accordingly:* In vcdb20, export the portcogeo, firmgeo, and bogeo tables* Import them as portcogeo_vcdb20, firmgeo_vcdb20, and bogeo_vcdb20* Build portcogrowthneedsgeo, firmneedsgeo, firmboneedsgeo files for geocoding* Log into [https://console.cloud.google.com/ Google Console] and set up an API key. Note that:** Up to $200/month should be free** $5.00 USD per 1000 lookups. ** 3,000 QPM max* In E:/tools/Geocode run the script(s): Geocode.py for portcos and GeocodeOneKey.py for everything else.** Strip the header line out of the input file(s)** python Geocode.py portcogrowthneedsgeo-NoHeader.txt* Get the latest Gazetteer file(s): https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html* Check the coverage of portcogeo and create firmbogeoplus ===Exits=== Another part of Load.sql does the matching to IPOs and MAs and their precedence. Note that:* Issuer and target names are matched against themselves using the [[Matcher.pl]] script in mode 2* PortCo stdnames are matched to issuerstd and targetstd (separately) in mode 0* The state and date matching requirements are in the load.sql.* There seems to have been duplicate issue records in the IPO data for vcdb2020 (and perhaps earlier). Some of the duplicates are often identical, except that the date is a day apart. * The IPO records also contain listings on junior and foreign exchanges, as well as some OTC - I left these in and flagged them. ===Industry=== The industry coding is in IndustryCodes.txt. Note that:* The code map had to be updated. The Excel file is in projects/vcdb24 but I didn't have the original counts to hand.* The codes are not unique (576 out of 585 are unique) at the industry subgroup 3 level.* The codes (code, code20, code100) should be joined using indclass, indminorgroup, indsubgroup* The codes are 1,2,4 but not 3dg hierarchical.* 1dg codes are IT, LS, and Other. * Note that code is the full 4dg industry identifier, where as code20 and code100 are name-based aggregates with at least 20 or 100 observations in them.{| class="wikitable" |-! 2dg Code! Minorcode! No. of PortCos|-| 11| Communications and Media| 3930|-| 12| Computer Hardware| 3058|-| 13| Computer Software and Services| 21157|-| 14| Internet Specific| 14440|-| 15| Semiconductors/Other Elect.| 3145|-| 21| Biotechnology| 4251|-| 22| Medical/Health| 7138|-| 31| Consumer Related| 7459|-| 32| Industrial/Energy| 7028|-| 33| Other Products| 14246|} I also tried some keyword industry coding from both short and long descriptions. The source code is at the top of BuildBaseTables.sql. The results are in sheets in the IndustryCodes.xlsx file. ===BuildBaseTables.sql=== Build the PortCoGrowthGeoId table that codes the city-state to a geoid. {| class="wikitable" |- style="vertical-align:bottom;"! origin! count! Method|-| style="vertical-align:bottom;" | 1| style="vertical-align:bottom;" | 45,111| style="color:#808080;" | Address is geocoded and in tiger place|-| style="vertical-align:middle; color:#808080;" | 2| style="vertical-align:bottom;" | 270| style="color:#808080;" | city, statecode matches to only 1 geoid, so use it|-| style="vertical-align:middle; color:#808080;" | 3| style="vertical-align:bottom;" | 1,374| style="color:#808080;" | city, statecode matches to multiple geoids, use the most popular|-| style="vertical-align:middle; color:#808080;" | 4| style="vertical-align:bottom;" | 964| style="color:#808080;" | 1:1 straight city<->place and statecode match with tiger|-| style="vertical-align:middle; color:#808080;" | 5| style="vertical-align:bottom;" | 509| style="color:#808080;" | Use zctaplaceinfo to lookup the best place choice for the zipcode|- style="vertical-align:bottom;"| style="vertical-align:middle; color:#808080;" | 6| 636| style="color:#808080;" | Unable to code|-| style="vertical-align:middle; color:#808080;" | 9| style="vertical-align:bottom;" | 24| style="color:#808080;" | Custom coded|} ===StartupCities=== The original Startup Cities code is in E:\projects\BayesianStartupCities\V1\startupcities.sql. The new version is in e:\projects\BayesianStartupCities\StartupCitiesV2.sql.