Changes

Jump to navigation Jump to search
no edit summary
[[File:AgglomerationDataSourcesAndSinks_v2.png|right|thumb|512px|Data Sources and Sinks]] The dataset construction begins with startup data from Thomson Reuters’ [[VentureXpert]]. This data is retrieved using [[SDC Platinum (Wiki)|SDC Platinum]] and comprises information on startup investment amounts and dates, stage descriptions, industries, and addresses. This data is combined with data on mergers and acquisitions from the Securities Data Commission M&A and Global New Issues databases, also available through SDC Platinum, to determine startup exit events. See [[VCDB2020]].
[[https://www2.census.gov/geo/tiger/TIGER2020/CBSA/tl_2020_us_cbsa.zip Shapefiles from the 2020 U.S. Census TIGER/Line data series]] provide the boundaries and names of the MSAs, and a python script (Geocode.py) in conjunction with a [[https://developers.google.com/maps/documentation/distance-matrix Google Maps API]], provides longitudes and latitudes for startups. We restrict the accuracy of Google’s results to four decimal places, which is [[http://wiki.gis.com/wiki/index.php/Decimal_degrees approximately 10m of precision]].
All of our data assembly, and much of our data processing and analysis, is done in a [[https://www.postgresql.org/ PostgreSQL]] [[https://postgis.net/ PostGIS]] database. See our [[Research Computing Infrastructure]] page for more information.
However, we rely on [[https://www.python.org/ python]] scripts to retrieve addresses from Google Maps, as well as compute the [[https://en.wikipedia.org/wiki/Hierarchical_clustering Hierarchical Cluster Analysis (HCA)]] itself, and estimate a cubic to determine the HCA-regression method agglomeration count for an [[https://en.wikipedia.org/wiki/Metropolitan_statistical_area MSA]]. We also use two [[https://www.stata.com/ Stata]] scripts: one to compute the HCA-regressions, and another to estimate the paper's summary statistics and regression specifications. Finally, we use QGIS to construct the map images based on queries to our database. These images use a [[https://maps.google.com Google Maps]] base layer.
== Data Processing Steps ==
A python script, [[:File:HCA_py.pdf|HCA.py]], consumes data on each startup and its location for each MSA-year. It performs the HCA and returns a file with layer and cluster numbers for each startup and MSA-year. This script builds upon:
* [[https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html sklearn.cluster.AgglomerativeClustering]]* [[https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html scipy.cluster.hierarchy.dendrogram]]* [[https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html scipy.spatial.distance.squareform]]
The HCA.py script uses several functions from another python module, [[:File:schedule_py.pdf|schedule.py]], which encodes agglomeration schedules produced by the AgglomerativeClustering package. The standard encoding [[https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py records the agglomeration schedule as complete paths]], indicating which clusters are merged together at each step. The layer-cluster encoding provided in schedule.py instead efficiently records the agglomeration schedule as a series of layers. It also relies on only a single read of the source data, so it is fast.
The code snippets provided in [[:File:hierarchy-InsertSnippets_py.pdf|hierarchy_InsertSnippets.py]] modify the standard library provided in the [[https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html scipy.cluster.hierarchy package]]. This code allows users to pre-calculate distances between locations (latitude-longitude pairs) using highly-accurate [[https://postgis.net/docs/ST_Distance.html PostGIS spatial functions]] in PostgreSQL. Furthermore, the code caches the results so, provided the distances fit into (high-speed) memory, it also allows users to increase the maximum feasible scale by around an order of magnitude. The code in hierarchy-InsertSnippets.py contains two snippets. The first snippet should be inserted at line 188 in the standard library. Then line 732 of the standard library should be commented out (i.e., #y = distance.pdist(y, metric)), and the second snippet should be inserted at line 734. A full copy of the amended [[:File:hierarchy_py.pdf|hierarchy.py]] is also available.
The results of the HCA.py script are loaded back to the database, which produces a dataset for analysis in Stata. The script [[:File:AgglomerationMaxR2.do.pdf|AgglomerationMaxR2.do]] loads this dataset and performs the HCA-Regressions. The results are passed to a python script, [[:File:Cubic_py.pdf|cubic.py]], which selects the appropriate number of agglomerations for each MSA. The results from both AgglomerationMaxR2.do and Cubic.py are then loaded back into the database, which produces a final dataset and set of tables providing data for the maps. The analysis of the final dataset uses the Stata script [[:File:AgglomerationAnalysis.do.pdf|AgglomerationAnalysis.do]], and the maps are made using custom queries in [[http://www.qgis.org QGIS]].
== Code ==
=== Cubic.py ===
[[:File:Cubic_py.pdf|Cubic.py]] solves the cubic estimation to select an agglomeration code for each MSA. It uses [[https://scikit-learn.org/stable/modules/linear_model.html sklearn.linear_model]].
<pdf>File:Cubic_py.pdf</pdf>
=== Hierarchy.py ===
[[:File:Hierarchy_py.pdf|Hierarchy.py]] is my modified version of Damian Eads' hierarchy.py, which is available as a part of the [[https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html scipy.cluster package]].
<pdf>File:Hierarchy_py.pdf</pdf>
=== Hierarchy-InsertSnippets.py ===
[[:File:Hierarchy-InsertSnippets_py.pdf|Hierarchy-InsertSnippets.py]] contains just the snippets needed to modify [[https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html hierarchy.py]]
<pdf>File:Hierarchy-InsertSnippets_py.pdf</pdf>
=== Schedule.py===
[[:File:Schedule_py.pdf|Schedule.py]] provides various schedule manipulation methods and the layer-cluster encoding scheme that works with [https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html scipy.cluster.hierarchy]].
<pdf>File:Schedule_py.pdf</pdf>

Navigation menu