Changes

4,872 bytes added , 13:46, 21 September 2020

no edit summary

{{Project

|Has project output=Data,Tool,How-to

|Has sponsor=Kauffman Incubator Project

|Has title=vcdb4

|Has owner=Ed Egan,

and others.

The build should be done as quickly but cleanly as possible, as it is needed right away but also will likely need to be updated in January of 2020 to reflect 2019's year -end. ===Notes=== There appears to be an error in the M&A data processing, which needs exploration. Compare: SELECT * FROM portcoexit WHERE coname ILIKE 'Augmentix%' OR coname ILIKE 'CheMatch%' OR coname ILIKE 'Hispanic Television%' OR coname ILIKE 'IDev T%' OR coname ILIKE 'Merrick%' OR coname ILIKE 'RosettaMed%' OR coname ILIKE 'SAT Corp%' OR coname ILIKE 'Visualase%'; SELECT * FROM mas WHERE targetname ILIKE 'Augmentix%' OR targetname ILIKE 'CheMatch%' OR targetname ILIKE 'Hispanic Television%' OR targetname ILIKE 'IDev T%' OR targetname ILIKE 'Merrick%' OR targetname ILIKE 'RosettaMed%' OR targetname ILIKE 'SAT Corp%' OR targetname ILIKE 'Visualase%';

==SDC Platinum Requests==

The next phase is to standardize the names and match the portcos to the IPOs and MAs. Then clean up all of the core tables. That takes us to the end of the old LoadingScriptV1.sql. We then need to do the equivalent of updating GeoLoad.sql. This time, we are going to use the old data, as before, but try updating any portco or firm that doesn't have at least 4dp (11.1m resolution) or perhaps 5dp (1.11m resolution) accuracy [http://wiki.gis.com/wiki/index.php/Decimal_degrees]. We have to fix the PortCo keys first, but we can only run 2,500 queries through Google each day for free per API key, so this should be a batch job.

==Enable PostGIS Extensions==

Before we can load up geometries and other PostGIS types, and do PostGIS operations, we need to:

create extension postgis;

===Updating GeoCoding===

perl .\Matcher.pl -mode=2 -file1="IPODistinctIssuer.txt" -file2="IPODistinctIssuer.txt"

Then ~~match~~ clean up the PortCo table by removing any duplicate firms with different datefirstinv records. Following this construct portcokeysstd (to be used to restrict keys through out) and companybasecore (49136). Match between and reviewMAs and IPOs:

perl .\Matcher.pl -mode=2 -file1="PortCoMatchInput.txt" -file2="MAMatchInput.txt"

*Of the remaining (463), take the min date announced.

Follow a similar procedure for the IPOs. There are 3134 IPOs, of which 2551 are all good (3 out of 3), 495 have bad dates or states, and 87 are left. For these 87, almost all are pairs (presumably the ones that aren't had a bad date) where there are two issue dates and we should take the minimum issue date. The result will be (with portcokey unique and valid):*MACleanNoDups (8344) - PortCoKey, conamestd, targetnamestd, MA Info...*IPOCleanNoDups (2600) - PortCoKey, conamestd, issuernamestd, IPO Info... ===Exits=== Next we need to determine precedence if a PortCo has both an IPO and an M&A record. There are 173 such records. This ~~can be~~ is done in the query '''ExitKeys'''. PortCoExit (49136) is then built. This version fixes an error from before, where announceddate and not effectivedate was being used as exitdate for M&As. Note that exitvaluem is left as null for non-exits. PortCoAliveDead still uses a 5 year interval, but the date truncation date is updated to 9/22/2019. exitvaluem is then updated to 0 for those that are deemed dead without exit and left as null for non-exits. Exitdate and year are left as is, as new fields (deaddate, deadyear) are created appropriately. ===Fund and Branch=== We have to key off of fundname for the fund table. Fundbase has 31931 on load. Throwing out undisclosed funds leaves 31540 with ~~SQL faster than in Excel~~31479 distinct names (requiring disclosed firmnames too reduces this to 31404). The solution is to keep the duplicate with the largest numportcos, discarding Net Ventures, Cambridge Innovations - Unspecified Fund', and Spring Capital, as these are the only three with a tie and both are of no consequence (numportco=1). Likewise, we have to key off of firmname for all firm tables. ~~Be aware~~ It isn't that we don't have other identifying information in firmbase (all records have founding date) but we only have firmname as a foreign key in the fund table and branch office table. ===Additions=== ====Population==== For now:*The population table (see Population.sql)*The statepop2017 table (from States2017PopulationEstimate.txt, with DC (2018), GU (2017), and PR (2017) added. We might want to join ~~back must~~ in the [[American Community Survey (ACS) Data]] too. ====Tiger Places==== We will use ~~statecode~~ the old Tiger Places, but it is worth exploring how to ~~deal with~~rebuild this from scratch. First, visit https: ~~Mobile Technologies LLC Mobile Technologies PA 4~~/28/~~2011 8~~www.census.gov/geographies/mapping-files/time-series/12geo/~~2013~~tiger-line-file.html. Then download each state's place file. Load them and merge them. See [[PostGIS_Installation#Bulk_Download_TIGER_Shapefiles]] for more information. It is also worth reading the [[Tiger Geocoder]] and [[Jeemin Sim (Work Log)]]. ====CPI==== ~~Mobile Technology Inc~~We loaded the old CPI file, updating it with real numbers for 2018 and an estimate for 2019 based on extrapolating the last four months from the first eight, and then computing an annual average. The inflator is normalized to 2017 dollars. Data was taken from: https://www.bls. ~~Mobile Technology Inc CA 12~~gov/1cpi/~~1985 6~~tables/28supplemental-files/~~1990~~historical-cpi-u-201908.pdf ====Other things to do==== We also might need to add: *Accelerator data? (See CohortPortCoMatch.sql for doing the portco matching) -- this might be in its own dbase*GSP (see Adding GDP.sql)*Industry, Firm type, Firm stage and other lookups (see MatchingEntreps.sql)*Titlelookup (see MatchingEntreps.sql)*PortCoSBIR and PortCoPatents (see MatchingEntreps.sql) ==Building Datasets== We need to collect code from:*Ranking.sql*MatchingEntreps.sql, which builds:**Women**Some of matching VC-Entreps? (much might now be redundant)*RevisedDBaseCode.sql, which builds:**Matching VC-Entreps**Some base tables and the super tables: PortCoSuper, FirmSuper, etc.*Agglomeration.sql (the latest version is in E:\projects\agglomeration), which builds:**Agglomeration!**Note: Needs updating to reflect latest decisions ===BuildBaseTables=== To this end, the base tables, which are used by multiple different projects, are built in BuildBaseTables.sql Build notes:*Common code from MatchingEntrepsV3.sql was moved to BuildBaseTables.sql*Almost all of the women's code now appears to be in the new base*The super tables still need building*The code for the Matching VC-Entreps project is now almost entirely in RevisedDBaseCode.sql ==Agglomeration== This build supports the [[Urban Start-up Agglomeration and Venture Capital Investment]] paper. It uses the sql script: AgglomerationVcdb4.sql, which is in E:\projects\agglomeration. The notes are here: [[Urban_Start-up_Agglomeration_and_Venture_Capital_Investment#vcdb4]]

~~Also, there is an issue with multiples not showing up as multiple matches, e.g.:~~ ~~ARCA BIOPHARMA Inc ARCA biopharma Inc Hall ARCA BIOPHARMA Inc ARCA BIOPHARMA Inc CO 3/1/2006 ARCA biopharma Inc ARCA biopharma Inc CO 9/25/2008 1 1 1 3 0 1~~ ~~ARCA BIOPHARMA Inc ARCA biopharma Inc Hall ARCA BIOPHARMA Inc ARCA Biopharma Inc CO 2/3/2003 ARCA biopharma Inc ARCA biopharma Inc CO 9/25/2008 1 1 1 3 0 1~~==Ranking==

~~This is because there aren't multiple variants - just multiples of the same variant~~A new Ranking. It happens when there are two records for one company with the same state but different datefirstinv dates. One solution is to build an exclusion list of portco keys (those with the coname, statecode, and min(datefirstinv)) and exclude those keys everywhere applicable. Instead we went back to the original portco table sql file was created in E:\projects\vcdb4 and ~~fixed it properly so that companybasecore doesn't have this issue~~run.

~~3134 IPOs~~There are imperfect matches between cities in VentureXpert and places in Tiger. In the past, this hasn't mattered as we'd considered such a small sub set of ~~which 2551 are all good~~ cities (and had gotten lucky on coverage). Geocoding covers 97.2607% (3 47.760 out of 349,136)of companybasecore, ~~495~~ 48,815 (99.35%) of which have ~~bad dates or states~~a non-null city. This drops to 96.42% of firms with datefirstinv >= 2010. Using just geocoded would therefore cost us a small number of firms, ~~and 87 are left~~but these may not be randomly distributed. ~~For these 87~~What we need, ~~almost all are pairs (presumably the ones that aren't had~~ then is a ~~bad date) where there are two issue dates and we should take the minimum issue date~~complete lookup table...

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Vcdb4 (view source)

Revision as of 13:46, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools