Delineating Spatial Agglomerations
This page provides code and dataset development material for:
Egan, Edward J. and James A. Brander (2022), "New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups.", Journal of Economic Geography, Manuscript: JOEG-2020-449.R2, forthcoming.
The dataset construction begins with startup data from Thomson Reuters’ VentureXpert. This data is retrieved using SDC Platinum and comprises information on startup investment amounts and dates, stage descriptions, industries, and addresses. This data is combined with data on mergers and acquisitions from the Securities Data Commission M&A and Global New Issues databases, also available through SDC Platinum, to determine startup exit events. See VCDB20.
Shapefiles from the 2020 U.S. Census TIGER/Line data series provide the boundaries and names of the MSAs, and a python script (Geocode.py) in conjunction with a Google Maps API, provides longitudes and latitudes for startups. We restrict the accuracy of Google’s results to four decimal places, which is approximately 10m of precision.
All of our data assembly, and much of our data processing and analysis, is done in a PostgreSQL PostGIS database. See our Research Computing Infrastructure page for more information.
However, we rely on python scripts to retrieve addresses from Google Maps, as well as compute the Hierarchical Cluster Analysis (HCA) itself, and estimate a cubic to determine the HCA-regression method agglomeration count for an MSA. We also use two Stata scripts: one to compute the HCA-regressions, and another to estimate the paper's summary statistics and regression specifications. Finally, we use QGIS to construct the map images based on queries to our database. These images use a Google Maps base layer.
Data Processing Steps
The script Agglomeration_CBSA.sql provides the processing steps within the PostgreSQL database. We first load the startup data, add in the longitudes and latitudes, and combine them with the CBSA boundaries. Startups in our data our keyed by a triple (coname, statecode, datefirstinv) as two different companies can have the same names in different states, or within the same state at two different times.
A python script, HCA.py, consumes data on each startup and its location for each MSA-year. It performs the HCA and returns a file with layer and cluster numbers for each startup and MSA-year. This script builds upon:
The HCA.py script uses several functions from another python module, schedule.py, which encodes agglomeration schedules produced by the AgglomerativeClustering package. The standard encoding records the agglomeration schedule as complete paths, indicating which clusters are merged together at each step. The layer-cluster encoding provided in schedule.py instead efficiently records the agglomeration schedule as a series of layers. It also relies on only a single read of the source data, so it is fast.
The code snippets provided in hierarchy_InsertSnippets.py modify the standard library provided in the scipy.cluster.hierarchy package. This code allows users to pre-calculate distances between locations (latitude-longitude pairs) using highly-accurate PostGIS spatial functions in PostgreSQL. Furthermore, the code caches the results so, provided the distances fit into (high-speed) memory, it also allows users to increase the maximum feasible scale by around an order of magnitude. The code in hierarchy-InsertSnippets.py contains two snippets. The first snippet should be inserted at line 188 in the standard library. Then line 732 of the standard library should be commented out (i.e., #y = distance.pdist(y, metric)), and the second snippet should be inserted at line 734. A full copy of the amended hierarchy.py is also available.
The results of the HCA.py script are loaded back to the database, which produces a dataset for analysis in Stata. The script AgglomerationMaxR2.do loads this dataset and performs the HCA-Regressions. The results are passed to a python script, cubic.py, which selects the appropriate number of agglomerations for each MSA. The results from both AgglomerationMaxR2.do and Cubic.py are then loaded back into the database, which produces a final dataset and set of tables providing data for the maps. The analysis of the final dataset uses the Stata script AgglomerationAnalysis.do, and the maps are made using custom queries in QGIS.
Agglomeration_CBSA.sql is the main SQL file that performs the overall data processing.
AgglomerationAnalysis.do is the final Stata do file, which performs the analysis the builds the tables and regression specifications used in the paper.
AgglomerationMaxR2.do performs the HCA regressions to select agglomeration counts for each MSA.
Cubic.py solves the cubic estimation to select an agglomeration code for each MSA. It uses sklearn.linear_model.
Geocode.py uses the Google Maps Distance API to compute distances between locations.
HCA.py is the main Hierarchical Clustering Analysis script. It takes startup keys (coname, statecode, datefirstinv) and locations (latitude, longitude) for each CBSA-year as a record and returns (layer,cluster) for each record.
Hierarchy.py is my modified version of Damian Eads' hierarchy.py, which is available as a part of the scipy.cluster package.
Hierarchy-InsertSnippets.py contains just the snippets needed to modify hierarchy.py
Schedule.py provides various schedule manipulation methods and the layer-cluster encoding scheme that works with scipy.cluster.hierarchy.