Changes

Jump to navigation Jump to search
no edit summary
This page provides code and data set dataset development material for:
Egan, Edward J. and James A. Brander (2022), "New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups.", Journal of Economic Geography, Manuscript: JOEG-2020-449.R2, forthcoming.
 
<pdf>File:Egan_Brander_(2022)_-_A_New_Method_for_Identifying_and_Delineating_Spatial_Agglomerations.pdf</pdf>
== Overview ==
[[File:AgglomerationDataSourcesAndSinks_v2.png|right|thumb|512px|Data Sources and Sinks]] The dataset construction begins with startup data from Thomson Reuters’ [[VentureXpert]]. This data is retrieved using [[SDC Platinum (Wiki)|SDC Platinum]] and comprises information on startup investment amounts and dates, stage descriptions, industries, and addresses. This data is combined with data on mergers and acquisitions from the Securities Data Commission M&A and Global New Issues databases, also available through SDC Platinum, to determine startup exit events. See [[VCDB2020VCDB20]].
[https://www2.census.gov/geo/tiger/TIGER2020/CBSA/tl_2020_us_cbsa.zip Shapefiles from the 2020 U.S. Census TIGER/Line data series] provide the boundaries and names of the MSAs, and a python script (Geocode.py) in conjunction with a [https://developers.google.com/maps/documentation/distance-matrix Google Maps API], provides longitudes and latitudes for startups. We restrict the accuracy of Google’s results to four decimal places, which is [http://wiki.gis.com/wiki/index.php/Decimal_degrees approximately 10m of precision].
All of our data assembly, and much of our data processing and analysis, is done in a [[https://www.postgresql.org/ PostgreSQL]] [https://postgis.net/ PostGIS] database. See our [[Research Computing Infrastructure]] page for more information.
However, we rely on [[https://www.python.org/ python]] scripts to retrieve addresses from Google Maps, as well as compute the [https://en.wikipedia.org/wiki/Hierarchical_clustering Hierarchical Cluster Analysis (HCA)] itself, and estimate a cubic to determine the HCA-regression method agglomeration count for an [https://en.wikipedia.org/wiki/Metropolitan_statistical_area MSA]. We also use two [https://www.stata.com/ Stata] scripts: one to compute the HCA-regressions, and another to estimate the paper's summary statistics and regression specifications. Finally, we use QGIS to construct the map images based on queries to our database. These images use a [https://maps.google.com Google Maps] base layer.
== Data Processing Steps ==
The script [[:File:AgglomerationProcess_v2Agglomeration_CBSA.sql.pngpdf|center|thumb|512px|Data Processing StepsAgglomeration_CBSA.sql]]provides the processing steps within the PostgreSQL database. We first load the startup data, add in the longitudes and latitudes, and combine them with the [https://en.wikipedia.org/wiki/Core-based_statistical_area CBSA] boundaries. Startups in our data our keyed by a triple (coname, statecode, datefirstinv) as two different companies can have the same names in different states, or within the same state at two different times.
The script [[:File:Agglomeration_CBSAAgglomerationProcess_v2.sql.pdfpng|center|thumb|768px|Agglomeration_CBSA.sql]] provides the processing steps within the PostgreSQL database. We first load the startup data, add in the longitudes and latitudes, and combine them with the [[https://en.wikipedia.org/wiki/Core-based_statistical_area CBSAData Processing Steps]] boundaries. Startups in our data our keyed by a triple (coname, statecode, datefirstinv) as two different companies can have the same names in different states, or within the same state at two different times.
A python script, [[:File:HCA_py.pdf|HCA.py]], consumes data on each startup and its location for each MSA-year. It performs the HCA and returns a file with layer and cluster numbers for each startup and MSA-year. This script builds upon:
* [https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html scipy.spatial.distance.squareform]
The HCA.py script uses several functions from another python module, [[:File:schedule_py.pdf|schedule.py]], which encodes agglomeration schedules produced by the AgglomerativeClustering package. The standard encoding [[https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py records the agglomeration schedule as complete paths]], indicating which clusters are merged together at each step. The layer-cluster encoding provided in schedule.py instead efficiently records the agglomeration schedule as a series of layers. It also relies on only a single read of the source data, so it is fast.
The code snippets provided in [[:File:hierarchy-InsertSnippets_py.pdf|hierarchy_InsertSnippets.py]] modify the standard library provided in the [https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html scipy.cluster.hierarchy package]. This code allows users to pre-calculate distances between locations (latitude-longitude pairs) using highly-accurate [https://postgis.net/docs/ST_Distance.html PostGIS spatial functions] in PostgreSQL. Furthermore, the code caches the results so, provided the distances fit into (high-speed) memory, it also allows users to increase the maximum feasible scale by around an order of magnitude. The code in hierarchy-InsertSnippets.py contains two snippets. The first snippet should be inserted at line 188 in the standard library. Then line 732 of the standard library should be commented out (i.e., #y = distance.pdist(y, metric)), and the second snippet should be inserted at line 734. A full copy of the amended [[:File:hierarchy_py.pdf|hierarchy.py]] is also available.

Navigation menu