Changes

Jump to navigation Jump to search
no edit summary
== Overview ==
[[File:AgglomerationDataSourcesAndSinks_v2.png|right|thumb|512px|Data Sources and Sinks]] The dataset construction begins with startup data from Thomson Reuters’ [[VentureXpert]]. This data is retrieved using [[SDC platinum Platinum (Wiki)|SDC Platinum]] and comprises information on startup investment amounts and dates, stage descriptions, industries, and addresses. This data is combined with data on mergers and acquisitions from the Securities Data Commission M&A and Global New Issues databases, also available through SDC Platinum, to determine startup exit events. See [[VCDB2020]].
[[https://www2.census.gov/geo/tiger/TIGER2020/CBSA/tl_2020_us_cbsa.zip Shapefiles from the 2020 U.S. Census TIGER/Line data series ]] provide the boundaries and names of the MSAs, and a python script (Geocode.py) in conjunction with a [[https://developers.google.com/maps/documentation/distance-matrix Google Maps API]], provides longitudes and latitudes for startups. We restrict the accuracy of Google’s results to four decimal places, which is [[http://wiki.gis.com/wiki/index.php/Decimal_degrees approximately 10m of precision]].
All of our data assembly, and much of our data processing and analysis, is done in a [[https://www.postgresql.org/ PostgreSQL ]] [[https://postgis.net/ PostGIS ]] database. See our [[Research Computing Infrastructure]] page for more information.
However, we rely on [[https://www.python.org/ python ]] scripts to retrieve addresses from Google Maps, as well as compute the [[https://en.wikipedia.org/wiki/Hierarchical_clustering Hierarchical Cluster Analysis (HCA) ]] itself, and estimate a cubic to determine the HCA-regression method agglomeration count for an [[https://en.wikipedia.org/wiki/Metropolitan_statistical_area MSA]]. We also use two [[https://www.stata.com/ Stata ]] scripts: one to compute the HCA-regressions, and another to estimate the paper's summary statistics and regression specifications. Finally, we use QGIS to construct the map images based on queries to our database. These images use a [[https://maps.google.com Google Maps ]] base layer.
== Data Processing Steps ==
[[File:AgglomerationProcess_v2.png|center|thumb|512px|Data Processing Steps]]
The script [[:File:Agglomeration_CBSA.sql.pdf|Agglomeration_CBSA.sql]] provides the processing steps within the PostgreSQL database. We first load the startup data, add in the longitudes and latitudes, and combine them with the [[https://en.wikipedia.org/wiki/Core-based_statistical_area CBSA ]] boundaries. Startups in our data our keyed by a triple (coname, statecode, datefirstinv) as two different companies can have the same names in different states, or within the same state at two different times.
A python script, [[:File:HCA_py.pdf|HCA.py]], consumes data on each startup and its location for each MSA-year. It performs the HCA and returns a file with layer and cluster numbers for each startup and MSA-year. This script builds upon:
The HCA.py script uses several functions from another python module, * [[https:File:schedule_py//scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html sklearn.pdf|schedulecluster.pyAgglomerativeClustering]], which encodes agglomeration schedules produced by the sklearn* [[https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.AgglomerativeClustering packagehierarchy.dendrogram. The standard encoding records the agglomeration schedule as complete paths, indicating which clusters are merged together at each stephtml scipy. The layer-cluster encoding provided in schedule.py instead efficiently records the agglomeration schedule as a series of layershierarchy.dendrogram]]* [[https://docs.scipy. It also relies on only a single read of the source data, so it is fastorg/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html scipy.spatial.distance.squareform]]
The code snippets provided in HCA.py script uses several functions from another python module, [[:File:hierarchy_InsertSnippets_pyschedule_py.pdf|hierarchy_InsertSnippetsschedule.py]] modify , which encodes agglomeration schedules produced by the AgglomerativeClustering package. The standard library provided in the scipyencoding [[https://scikit-learn.org/stable/auto_examples/cluster package of the same name/plot_agglomerative_dendrogram. This code allows users to prehtml#sphx-calculate distances between locations (latitudeglr-longitude pairs) using highlyauto-accurate PostGIS spatial functions in PostgreSQL. Furthermore, examples-cluster-plot-agglomerative-dendrogram-py records the code caches the results soagglomeration schedule as complete paths]], provided the distances fit into (high-speed) memory, it also allows users to increase the maximum feasible scale by around an order of magnitudeindicating which clusters are merged together at each step. The code layer-cluster encoding provided in “hierarchyschedule.py” contains two snippets. The first snippet should be inserted at line 188 in py instead efficiently records the standard libraryagglomeration schedule as a series of layers. Then line 732 It also relies on only a single read of the standard library should be commented out (i.e.source data, #y = distance.pdist(y, metric)), and the second snippet should be inserted at line 734. A full copy of the amended [[:File:hierarchy_py.pdf|hierarchy.py]] so it is also availablefast.
The code snippets provided in [[:File:hierarchy-InsertSnippets_py.pdf|hierarchy_InsertSnippets.py]] modify the standard library provided in the [[https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html scipy.cluster.hierarchy package]]. This code allows users to pre-calculate distances between locations (latitude-longitude pairs) using highly-accurate [[https://postgis.net/docs/ST_Distance.html PostGIS spatial functions]] in PostgreSQL. Furthermore, the code caches the results so, provided the distances fit into (high-speed) memory, it also allows users to increase the maximum feasible scale by around an order of magnitude. The code in hierarchy-InsertSnippets.py contains two snippets. The first snippet should be inserted at line 188 in the standard library. Then line 732 of the standard library should be commented out (i.e., #y = distance.pdist(y, metric)), and the second snippet should be inserted at line 734. A full copy of the amended [[:File:hierarchy_py.pdf|hierarchy.py]] is also available. The results of the HCA.py script are loaded back to the database, which produces a dataset for analysis in Stata. The script [[:File:AgglomerationMaxR2.do .pdf|AgglomerationMaxR2.do]] loads this dataset and performs the HCA-Regressions. The results are passed to a python script, [[:File:Cubic_py.pdf|cubic.py]], which selects the appropriate number of agglomerations for each MSA. The results from both [[:File:AgglomerationMaxR2.do.pdf|AgglomerationMaxR2.do]] and Cubic.py are then loaded back into the database, which produces a final dataset and set of tables providing data for the maps. The analysis on of the final dataset uses the Stata script [[:File:AgglomerationAnalysis.do.pdf|AgglomerationAnalysis.do]], and the maps are made using custom queries in [[http://www.qgis.org QGIS]].
== Code ==

Navigation menu