Changes

Delineating Spatial Agglomerations (view source)

Revision as of 18:33, 20 July 2022

1,329 bytes added , 18:33, 20 July 2022

no edit summary

== Overview ==

[[File:AgglomerationDataSourcesAndSinks_v2.png|right|thumb|512px|Data Sources and Sinks]] The dataset construction begins with startup data from Thomson Reuters’ [[VentureXpert]]. This data is retrieved using [[SDC ~~platinum~~ Platinum (Wiki)|SDC Platinum]] and comprises information on startup investment amounts and dates, stage descriptions, industries, and addresses. This data is combined with data on mergers and acquisitions from the Securities Data Commission M&A and Global New Issues databases, also available through SDC Platinum, to determine startup exit events. See [[VCDB2020]].

[[https://www2.census.gov/geo/tiger/TIGER2020/CBSA/tl_2020_us_cbsa.zip Shapefiles from the 2020 U.S. Census TIGER/Line data series ]] provide the boundaries and names of the MSAs, and a python script (Geocode.py) in conjunction with a [[https://developers.google.com/maps/documentation/distance-matrix Google Maps API]], provides longitudes and latitudes for startups. We restrict the accuracy of Google’s results to four decimal places, which is [[http://wiki.gis.com/wiki/index.php/Decimal_degrees approximately 10m of precision]].

All of our data assembly, and much of our data processing and analysis, is done in a [[https://www.postgresql.org/ PostgreSQL ]] [[https://postgis.net/ PostGIS ]] database. See our [[Research Computing Infrastructure]] page for more information.

However, we rely on [[https://www.python.org/ python ]] scripts to retrieve addresses from Google Maps, as well as compute the [[https://en.wikipedia.org/wiki/Hierarchical_clustering Hierarchical Cluster Analysis (HCA) ]] itself, and estimate a cubic to determine the HCA-regression method agglomeration count for an [[https://en.wikipedia.org/wiki/Metropolitan_statistical_area MSA]]. We also use two [[https://www.stata.com/ Stata ]] scripts: one to compute the HCA-regressions, and another to estimate the paper's summary statistics and regression specifications. Finally, we use QGIS to construct the map images based on queries to our database. These images use a [[https://maps.google.com Google Maps ]] base layer.

== Data Processing Steps ==

[[File:AgglomerationProcess_v2.png|center|thumb|512px|Data Processing Steps]]

The script [[:File:Agglomeration_CBSA.sql.pdf|Agglomeration_CBSA.sql]] provides the processing steps within the PostgreSQL database. We first load the startup data, add in the longitudes and latitudes, and combine them with the [[https://en.wikipedia.org/wiki/Core-based_statistical_area CBSA ]] boundaries. Startups in our data our keyed by a triple (coname, statecode, datefirstinv) as two different companies can have the same names in different states, or within the same state at two different times.

A python script, [[:File:HCA_py.pdf|HCA.py]], consumes data on each startup and its location for each MSA-year. It performs the HCA and returns a file with layer and cluster numbers for each startup and MSA-year. This script builds upon:

~~The HCA.py script uses several functions from another python module,~~ * [[https:~~File:schedule_py~~//scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html sklearn.~~pdf|schedule~~cluster.pyAgglomerativeClustering]]~~, which encodes agglomeration schedules produced by the sklearn~~* [[https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.~~AgglomerativeClustering package~~hierarchy.dendrogram. ~~The standard encoding records the agglomeration schedule as complete paths, indicating which clusters are merged together at each step~~html scipy. ~~The layer-~~cluster ~~encoding provided in schedule~~.~~py instead efficiently records the agglomeration schedule as a series of layers~~hierarchy.dendrogram]]* [[https://docs.scipy. ~~It also relies on only a single read of the source data, so it is fast~~org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html scipy.spatial.distance.squareform]]

The ~~code snippets provided in~~ HCA.py script uses several functions from another python module, [[:File:~~hierarchy_InsertSnippets_py~~schedule_py.pdf|~~hierarchy_InsertSnippets~~schedule.py]] ~~modify~~ , which encodes agglomeration schedules produced by the AgglomerativeClustering package. The standard ~~library provided in the scipy~~encoding [[https://scikit-learn.org/stable/auto_examples/cluster ~~package of the same name~~/plot_agglomerative_dendrogram. ~~This code allows users to pre~~html#sphx-~~calculate distances between locations (latitude~~glr-~~longitude pairs) using highly~~auto-~~accurate PostGIS spatial functions in PostgreSQL. Furthermore,~~ examples-cluster-plot-agglomerative-dendrogram-py records the ~~code caches the results so~~agglomeration schedule as complete paths]], ~~provided the distances fit into (high-speed) memory, it also allows users to increase the maximum feasible scale by around an order of magnitude~~indicating which clusters are merged together at each step. The ~~code~~ layer-cluster encoding provided in ~~“hierarchy~~schedule.~~py” contains two snippets. The first snippet should be inserted at line 188 in~~ py instead efficiently records the ~~standard library~~agglomeration schedule as a series of layers. ~~Then line 732~~ It also relies on only a single read of the ~~standard library should be commented out (i.e.~~source data, ~~#y = distance.pdist(y, metric)), and the second snippet should be inserted at line 734. A full copy of the amended [[:File:hierarchy_py.pdf|hierarchy.py]]~~ so it is ~~also available~~fast.

The code snippets provided in [[:File:hierarchy-InsertSnippets_py.pdf|hierarchy_InsertSnippets.py]] modify the standard library provided in the [[https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html scipy.cluster.hierarchy package]]. This code allows users to pre-calculate distances between locations (latitude-longitude pairs) using highly-accurate [[https://postgis.net/docs/ST_Distance.html PostGIS spatial functions]] in PostgreSQL. Furthermore, the code caches the results so, provided the distances fit into (high-speed) memory, it also allows users to increase the maximum feasible scale by around an order of magnitude. The code in hierarchy-InsertSnippets.py contains two snippets. The first snippet should be inserted at line 188 in the standard library. Then line 732 of the standard library should be commented out (i.e., #y = distance.pdist(y, metric)), and the second snippet should be inserted at line 734. A full copy of the amended [[:File:hierarchy_py.pdf|hierarchy.py]] is also available. The results of the HCA.py script are loaded back to the database, which produces a dataset for analysis in Stata. The script [[:File:AgglomerationMaxR2.do .pdf|AgglomerationMaxR2.do]] loads this dataset and performs the HCA-Regressions. The results are passed to a python script, [[:File:Cubic_py.pdf|cubic.py]], which selects the appropriate number of agglomerations for each MSA. The results from both ~~[[:File:AgglomerationMaxR2.do.pdf|~~AgglomerationMaxR2.do]] and Cubic.py are then loaded back into the database, which produces a final dataset and set of tables providing data for the maps. The analysis on of the final dataset uses the Stata script [[:File:AgglomerationAnalysis.do.pdf|AgglomerationAnalysis.do]], and the maps are made using custom queries in [[http://www.qgis.org QGIS]].

== Code ==

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Delineating Spatial Agglomerations (view source)

Revision as of 18:33, 20 July 2022

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools