|Has title||Restoring vcdb3|
|Has owner||Ed Egan|
|Has start date|
|Has deadline date|
|Has project status||Complete|
|Has project output||Data|
|Copyright © 2019 edegan.com. All Rights Reserved.|
The VentureXpert Data build notes.
Restore the dbase
Run the following:
cd /bulk/mcnair/backups su researcher createdb vcdb3 pg_restore -Fc -d vcdb3 vcdb3_Fc_230818.dump cd /bulk mkdir vcdb3 cd vcdb3 psql vcdb3
Reload the last changes
- make vcdb3 in E:\projects
- copy C:\Users\ed\Dropbox\coauthoredprojects\NonMcNair\MatchingVCs\RevisedDbaseCode.sql to e:\projects\vcdb3
- Run that code!
1.6m queries running in <7s index on 1.6m in <1s Big table (99m) used to take ~2mins, now takes 2:15. Might want to reoptimize config. Big reduction used to take 12mins. Now takes 55 secs! In later queries (not as big), allocated cpu 97-100% but mem only 3.2% Copy up to 3.6% mem usage. Max observed 4.3%. All of the code ran correctly, with the right counts, etc.
Code was in:
The following files were copied to E:\projects\vcdb3\OriginalSQL
- Adding GDP.sql
- Agglomeration.SQL - this is the agglomeration work
- LoadingScriptsV1.sql - this does the load!
- MatchingEntrepsV3.sql - this is the main build, after the load
- RoundRanking.sql - this is the city ranking script
- RoundRankingStates.sql - this is the state ranking script
Note that all code, including ranking code was run before backup.
Fix the Geocoding
The longitudes and latitudes were stored as numeric, rather than real, but this doesn't seem to the source of the problem as it has suitable precision. Instead the problem seems to be in the loading and unloading of tables through the \COPY command.
Regardless, we have decimal degrees to six decimal places from Google Maps, which is equivalent to a staggering 11cm of accuracy at the equator. See http://wiki.gis.com/wiki/index.php/Decimal_degrees.
There are some notes on VentureXpert_Data#GeoCoding_Companies.2C_Firms.2C_and_Branch_Offices and the SQL appears to be in:
The company process began with a load of oldgeocords 44740 from companybasegeomaster.txt (which came from vcdb2). The problem is that companybasegeomaster has low precision data in it. This was documented in VC_Database_Rebuild#Re-Fixing_erroneous_geo-coordinates. There's good data in vcdb2:
- Geocore 43628 (with nulls), keyed by coname (colevelsimple)
- Geoallcoords 44999 (with nulls), keyed by coname, datefirst, statecode
- Geoallcoords1 44999 (with nulls), keyed by coname, datefirst, statecode, with exclude flag
The next step in the company geo process was to load remaining from RemainingLatLong.txt. This is high precision data.
The fix to the geo data is at the top of:
Essentially this fix:
- DROPs portcogeo; oldgeocords; geoallcoords; geoallcoords1; goodgeoold; geoallcoords2; geoallcoords3; goodgeonew; geocodesportco;
- Load geoallcoords1fromvcdb2 (which comes from vcdb2!)
- Merge the results of geoallcoords1fromvcdb2 with the resutls of remaining
- Mark the out-of-bounds exclusions
- Produce portcogeo 47715
For the firms, there are three sources, all of which are high precision:
- bogeo 2046 from BranchOfficesGeo.txt
- oldfirmcoords 5556 from FirmCoords.txt
- firmremainingcoords 706 from FirmRemainingCoords.txt
- bogeo 2046
- firmgeocoords 6049
- firmbasecore 15437, which is large because it contains non-US