Changes

Jump to navigation Jump to search
no edit summary
This page will provide provides code and data dataset development material for:
Egan, Edward J. and James A. Brander (2022), "New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups.", Journal of Economic Geography, Manuscript: JOEG-2020-449.R2, forthcoming.
Please check <pdf>File:Egan_Brander_(2022)_-_A_New_Method_for_Identifying_and_Delineating_Spatial_Agglomerations.pdf</pdf> == Overview == [[File:AgglomerationDataSourcesAndSinks_v2.png|right|thumb|512px|Data Sources and Sinks]] The dataset construction begins with startup data from Thomson Reuters’ [[VentureXpert]]. This data is retrieved using [[SDC Platinum (Wiki)|SDC Platinum]] and comprises information on startup investment amounts and dates, stage descriptions, industries, and addresses. This data is combined with data on mergers and acquisitions from the Securities Data Commission M&A and Global New Issues databases, also available through SDC Platinum, to determine startup exit events. See [[VCDB20]]. [https://www2.census.gov/geo/tiger/TIGER2020/CBSA/tl_2020_us_cbsa.zip Shapefiles from the 2020 U.S. Census TIGER/Line data series] provide the boundaries and names of the MSAs, and a python script (Geocode.py) in conjunction with a [https://developers.google.com/maps/documentation/distance-matrix Google Maps API], provides longitudes and latitudes for startups. We restrict the accuracy of Google’s results to four decimal places, which is [http://wiki.gis.com/wiki/index.php/Decimal_degrees approximately 10m of precision].  All of our data assembly, and much of our data processing and analysis, is done in a [https://www.postgresql.org/ PostgreSQL] [https://postgis.net/ PostGIS] database. See our [[Research Computing Infrastructure]] page for more information. However, we rely on [https://www.python.org/ python] scripts to retrieve addresses from Google Maps, as well as compute the [https://en.wikipedia.org/wiki/Hierarchical_clustering Hierarchical Cluster Analysis (HCA)] itself, and estimate a cubic to determine the HCA-regression method agglomeration count for an [https://en.wikipedia.org/wiki/Metropolitan_statistical_area MSA]. We also use two [https://www.stata.com/ Stata] scripts: one to compute the HCA-regressions, and another to estimate the paper's summary statistics and regression specifications. Finally, we use QGIS to construct the map images based on queries to our database. These images use a [https://maps.google.com Google Maps] base layer. == Data Processing Steps == The script [[:File:Agglomeration_CBSA.sql.pdf|Agglomeration_CBSA.sql]] provides the processing steps within the PostgreSQL database. We first load the startup data, add in the longitudes and latitudes, and combine them with the [https://en.wikipedia.org/wiki/Core-based_statistical_area CBSA] boundaries. Startups in our data our keyed by a triple (coname, statecode, datefirstinv) as two different companies can have the same names in different states, or within the same state at two different times.  [[File:AgglomerationProcess_v2.png|center|thumb|768px|Data Processing Steps]]  A python script, [[:File:HCA_py.pdf|HCA.py]], consumes data on each startup and its location for each MSA-year. It performs the HCA and returns a file with layer and cluster numbers for each startup and MSA-year. This script builds upon: * [https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html sklearn.cluster.AgglomerativeClustering]* [https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html scipy.cluster.hierarchy.dendrogram]* [https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html scipy.spatial.distance.squareform] The HCA.py script uses several functions from another python module, [[:File:schedule_py.pdf|schedule.py]], which encodes agglomeration schedules produced by the AgglomerativeClustering package. The standard encoding [https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py records the agglomeration schedule as complete paths], indicating which clusters are merged together at each step. The layer-cluster encoding provided in schedule.py instead efficiently records the agglomeration schedule as a series of layers. It also relies on only a single read of the source data, so it is fast. The code snippets provided in [[:File:hierarchy-InsertSnippets_py.pdf|hierarchy_InsertSnippets.py]] modify the standard library provided in the [https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html scipy.cluster.hierarchy package]. This code allows users to pre-calculate distances between locations (latitude-longitude pairs) using highly-accurate [https://postgis.net/docs/ST_Distance.html PostGIS spatial functions] in PostgreSQL. Furthermore, the code caches the results so, provided the distances fit into (high-speed) memory, it also allows users to increase the maximum feasible scale by around an order of magnitude. The code in hierarchy-InsertSnippets.py contains two snippets. The first snippet should be inserted at line 188 in the standard library. Then line 732 of the standard library should be commented out (i.e., #y = distance.pdist(y, metric)), and the second snippet should be inserted at line 734. A full copy of the amended [[:File:hierarchy_py.pdf|hierarchy.py]] is also available. The results of the HCA.py script are loaded back soon!to the database, which produces a dataset for analysis in Stata. The script [[:File:AgglomerationMaxR2.do.pdf|AgglomerationMaxR2.do]] loads this dataset and performs the HCA-Regressions. The results are passed to a python script, [[:File:Cubic_py.pdf|cubic.py]], which selects the appropriate number of agglomerations for each MSA. The results from both AgglomerationMaxR2.do and Cubic.py are then loaded back into the database, which produces a final dataset and set of tables providing data for the maps. The analysis of the final dataset uses the Stata script [[:File:AgglomerationAnalysis.do.pdf|AgglomerationAnalysis.do]], and the maps are made using custom queries in [http://www.qgis.org QGIS]. == Code == === Agglomeration_CBSA.sql === [[:File:Agglomeration_CBSA.sql.pdf|Agglomeration_CBSA.sql]] is the main SQL file that performs the overall data processing.<pdf>File:Agglomeration_CBSA.sql.pdf</pdf> === AgglomerationAnalysis.do ===[[:File:AgglomerationAnalysis.do.pdf|AgglomerationAnalysis.do]] is the final Stata do file, which performs the analysis the builds the tables and regression specifications used in the paper.<pdf>File:AgglomerationAnalysis.do.pdf</pdf> === Agglomeration_CBSA.sql===[[:File:AgglomerationMaxR2.do.pdf|AgglomerationMaxR2.do]] performs the HCA regressions to select agglomeration counts for each MSA.<pdf>File:AgglomerationMaxR2.do.pdf</pdf> === Cubic.py ===[[:File:Cubic_py.pdf|Cubic.py]] solves the cubic estimation to select an agglomeration code for each MSA. It uses [https://scikit-learn.org/stable/modules/linear_model.html sklearn.linear_model].<pdf>File:Cubic_py.pdf</pdf> === Geocode.py ===[[:File:Geocode_py.pdf|Geocode.py]] uses the Google Maps Distance API to compute distances between locations.<pdf>File:Geocode_py.pdf</pdf> === HCA.py ===[[:File:HCA_py.pdf|HCA.py]] is the main Hierarchical Clustering Analysis script. It takes startup keys (coname, statecode, datefirstinv) and locations (latitude, longitude) for each CBSA-year as a record and returns (layer,cluster) for each record. <pdf>File:HCA_py.pdf</pdf> === Hierarchy.py ===[[:File:Hierarchy_py.pdf|Hierarchy.py]] is my modified version of Damian Eads' hierarchy.py, which is available as a part of the [https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html scipy.cluster package].<pdf>File:Hierarchy_py.pdf</pdf> === Hierarchy-InsertSnippets.py ===[[:File:Hierarchy-InsertSnippets_py.pdf|Hierarchy-InsertSnippets.py]] contains just the snippets needed to modify [https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html hierarchy.py]<pdf>File:Hierarchy-InsertSnippets_py.pdf</pdf> === Schedule.py===[[:File:Schedule_py.pdf|Schedule.py]] provides various schedule manipulation methods and the layer-cluster encoding scheme that works with [https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html scipy.cluster.hierarchy]. <pdf>File:Schedule_py.pdf</pdf>

Navigation menu