Changes

Jump to navigation Jump to search
no edit summary
This paper is published as:  [[Delineating Spatial Agglomerations|Egan, Edward J. and James A. Brander (2022), "New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups.", Journal of Economic Geography, Manuscript: JOEG-2020-449.R2, forthcoming.]] {{AcademicPaper|Has title=Urban Start-up Agglomeration and Venture Capital Investment|Has author=Ed Egan,Jim Brander|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden, Jeemin Sim|Has paper status=Working paperPublished}} =Working PaperNew SubmissionA revised version of the paper, now co-authored with [[Jim Brander]] and based on the version 3 rebuild, was submitted to the Journal of Economic Geography. This is solely a methods paper, and is titled: '''A New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups'''. The last policy application would need to be written up as a separate paper. ==Acceptance== On July 5th 2022, the paper was accepted to the Journal of Economic Geography: * Manuscript ID JOEG-2020-449.R2* Title: A New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups* Author(s): Edward J. Egan and James A. Brander.* Editor: Bill Kerr, HBS: wkerr@hbs.edu* Abstract: This paper advances a new approach using hierarchical cluster analysis (HCA) for identifying and delineating spatial agglomerations and applies it to venture-backed startups. HCA identifies nested clusters at varying aggregation levels. We describe two methods for selecting a particular aggregation level and the associated agglomerations. The “elbow method” relies entirely on geographic information. Our preferred method, the “regression method”, also uses venture capital investment data and identifies finer agglomerations, often the size of a small neighborhood. We use heat maps to illustrate how agglomerations evolve and indicate how our methods can aid in evaluating agglomeration support policies.* Permanent link for code/data: https://www.edegan.com/wiki/Delineating_Spatial_Agglomerations The paper is now in production. I will build a wiki page called [[Delineating_Spatial_Agglomerations]] that structures the documentation of the build process and shares code and some data or artifacts. Currently, that page redirects here. == R&R == Files:*Pdf: [[File:Egan Brander (2020) - A New Method for Identifying and Delineating Spatial Agglomerations (Submitted to JEG).pdf]]* In E:\projects\agglomeration** Last document was Agglomeration Dec 15.docx** Build is Version 3-6-2-2. ** SQL file is: AgglomerationVcdb4.sql After some inquiries, we heard from Bill Kerr, the associate editor, that the paper had new reviews on Aug 11th. On Aug 23rd, we recieved an email titled "Journal of Economic Geography - Decision on Manuscript ID JOEG-2020-449" giving us an R&R. Overall, the R&R is very positive. Bill's comments:* Referees aligned on central issue of Census places* Too short: Wants application and suggests ("not contractual requirements"):** Diversity within and between in terms of types of VC investment (e.g., Biotech vs. ICT in Waltham)** Patent citiations made between VC backed firms Reviewer 1's comments (excluding minor things):* Explain projection (should have said it was WGS1984)* Starting units: Suggests MSA level. Suppose cities that are close... can we find cases?* Identify clusters that have grown over time* Maybe try a cluster-level analysis* Is ruling out the first second-difference too limiting? Can a city be a cluster? (Vegas, baby?, Or starting from CMSA, probably yes in some sense.)* Discuss cluster boundaries (they aren't hard and fast: "think of these clusters as the kernels or seeds of VC-backedstartup hotspots") Reviewer 2's comments (excluding minor things):* Starting Units. Suggests MSA. * Explain R2 method better. He didn't say try cluster-level but that might be helpful to him too.* Change language (back) to microgeographies! (or startup neighborhoods). * Tighter connection to lit. He gives papers to start.* Discuss overlap of clusters (a la patent clustering). Check findings in Kerr and Kominers!!!* Discuss counterfactuals/cause-and-effect/application etc. Show/discuss that we didn't just find office parks. <pdf>File:JOEG1RndReviews.pdf</pdf>  ===Notes for further improvement=== We might want to add some things in/back in. These include technical notes:*To do the HCA we used the AgglomerativeClustering method from the sklearn.cluster library (version 0.20.1) in python 3.7.1, with Ward linkage and connectivity set to none. This method is documented here: https://scikit-learn.org/stable/modules/clustering.html. I checked some of the early results against an implementation of Ward's method using the agnes function, available through the cluster package, in R. https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/agnes*The data was assembled and processed in a Postgresql (version 10) database using PostGIS (version 2.4). We used World Geodetic System revision 84, known as WGS1984 (see https://en.wikipedia.org/wiki/World_Geodetic_System), as a coordinate system with an ellipsoidal earth, to calculate distances and areas (see https://postgis.net/docs/manual-2.4/using_postgis_dbmanagement.html). Shapefiles for Census Places were retrieved from the U.S. Census TIGER (Topologically Integrated Geographic Encoding and Referencing) database (see https://www.census.gov/programs-surveys/geography.html).*The statistical analysis was done in STATA/MP version 15.*All maps were made using QGIS v3.8.3. The base map is from Google Maps. City areas are highlighted using U.S. Census TIGER/Line Shapefiles.  The methodology has other applications:*Food deserts - one could study the paperagglomerations of restaurants and other food providers in urban environments.*Airports, cement factories, banana plantations, police/fire stations, hospitals/drug stores, etc.*We could think about commercial applications. Perhaps locating plants/facilities that are/aren't in clusters with a view to buying or selling them? =SSRN version of the paper (uses v2 build)= There are two 'final' papers based on the version 2 build. The one with Houston narrative as the motivation, is available from SSRN: https://papers.ssrn.com/abstract=3537162 The Management Science submission version has a more conventional front end and is as follows: <pdf>File:AgglomerationV8-Reduced.pdf</pdf> A new version, written by Jim, is in the works! =New WorkVersion 3 Rebuild= ===Another round of refinements=== #The elbow method is pretty questionable has issues in its current form, so we are going to try using the elbow in the curvature (degree of concavity) instead. #We might also try using elasticities...#Rerun the distance calculations -- avghulldisthm and avgdisthm are only computed for layers that we select with some method (like max r2). However, this table hadn't been updated for the elbow method, perhaps as well as some other methods, so some distances would have been missing (and replaced with zeros in the STATA script).#Create and run the new max R2 layer. In this variant, we'll use "the first layer a cluster number is reached as the representative layer for that cluster number"  I built two new curvature based elbow methods and so variables: curvaturelayer and curvaturelayerrestricted. They use the method described below and are identical except that curvaturelayerrestricted can't select layer 2 (both can't select the first and last layers as they use central second differences). For the example cities we have:{| class="wikitable" style="vertical-align:bottom;"|-! place! statecode! year! numstartups! elbowlayer! finallayer! curvaturelayer|-| Menlo Park| CA| 2,006| 68| 4| 51| 4|-| San Diego| CA| 2,006| 220| 3| 184| 181|-| Campbell| CA| 2,006| 38| 3| 26| 8|-| Charlotte| NC| 2,006| 30| 3| 30| 28|-| Waltham| MA| 2,006| 106| 3| 58| 55|} For these city-years, the curvaturelayer is the same as the curvaturelayerrestricted. As you can see, it is all over the place! I really don't think we can say that this method 'works' for any real value of 'works'. There's a sheet (Curvature Raw Data Examples) in ResultsV3-6.xlsx, and there's graphs for the selected cities on sheet "Elbow Curvature Selected Cities". ====New MaxR2 Layer==== I noticed a copy and paste error in the do file and I re-ran the existing max R2 method too, just to be sure. My process for the new method uses the code for the old chosenhullflayer variable. Key variables are:*firstlayer the layer at which numclusters first achieves that value*regfirst an indicator to select the right set of layers to run the max r2 estimation on*chosenhullflayer - the variable that records the layer number selected using firstlayer and the max r2 method*besthullflayer - the equivalent to besthulllayer but with the first layers instead of the lowest-highest ones*targetnumclustersf, besthullflayerisadded, maxr2flayerflag, etc.*'''regmaxr2f''' and '''regbestf''' - these are the dataset constraints to use. Everything is pushed through the database and back to generate them. The results for our sample cities are as follows:{| class="wikitable" style="vertical-align:bottom;"|-! place! statecode! year! finallayer! chosenhulllayer! style="font-weight:bold;" | chosenhullflayer! elbowlayer|-| Campbell| CA| 2,006| 26| 15| style="font-weight:bold;" | 3| style="font-weight:bold;" | 3|-| Charlotte| NC| 2,006| 30| 14| style="font-weight:bold;" | 3| style="font-weight:bold;" | 3|-| Menlo Park| CA| 2,006| 51| 33| style="font-weight:bold;" | 21| style="font-weight:bold;" | 4|-| San Diego| CA| 2,006| 184| 141| style="font-weight:bold;" | 12| style="font-weight:bold;" | 3|-| Waltham| MA| 2,006| 58| 31| style="font-weight:bold;" | 3| style="font-weight:bold;" | 3|} I build the max R2 graphs in the sheet '''New MaxR2''' in ResultsV3-6.xlsx ====Jim's notes on the curvature==== Suppose we have a function f. Then what I have been calling the curvature is -f’’/f’. If f is a utility function this is the coefficient of absolute risk aversion and it has quite often been called curvature in that context. However, in differential geometry curvature is described differently, although it is quite similar. Mas-Collel and others have suggested calling -f’’/f’ the “degree of concavity” instead. I came across this definition on the internet: :“The degree of concavity is measured by the proportionate rate of decrease of the slope, that is, the rate at which the slope decreases divided by the slope itself.” The general rationale for using this measure is that it is invariant to scale, whereas the straight second derivative, f’’, is not invariant. The same applies to the second difference of course." So our measure is the second difference divided by the first difference. However, it is not clear whether we should divide by the initial first difference or the second first difference or the average. I initially assumed that we should use the initial first difference. I now think that is wrong as it can produce anomalies. I think we should use the second (or “current”) first difference as the base.  Here is some data I sent before: {| class="wikitable" style="text-align:right;"|- style="background-color:#FFF; color:#222;"| Layer| SSR| D1| D2
| Concavity
| Concavity
*'''Concavity''' (in col6) as -D2_l/D1_{l+1}, or -1 times the central first over the central second: <math> --5/35 = 0.43 \approx 0.14</math>
The concavity measure in col6 is therefore the -1 times central first difference divided by the central second difference, but the central first isn't computable for a step of 1 (and gives a weird answer anyway, as it straddles the observation in question). The central second difference isn't defined for either the first or last layer, and the backward first difference isn't defined for the first layer. It seems likely that we don't want the last layer and might get it because D1 is small and drives the ratio. So, we  We could instead use the forward first difference - this isn't available for the last observation (for which we can't compute a second central anyway) but is available for the first observation - and increment the answer, much as Jim proposes decrementing it when using the backward layer. But seeing as we can't use the first observation we've gained nothing anyway! So we'll do Jim's method verbatim, and declare the result null if it comes out as either the first or last layer. ====Curvature==== {{Colored box|title=Specification|content=For layer <math>l</math>, I compute the curvature as -1 times the backward first difference in the variance explained ratio from layer <math>l+1</math> divided by the central second difference in the variance explained ratio from <math>l</math>. The first and last layers are forbidden results.}} The curvature results seem somewhat better than the elbow results but are still far from ideal. Here are some things I look for and/or don't like in a layer selection method:*Interior solutions are good, collapsing to the bounds, especially the lower bound is bad*Stable interior solutions are better - when the results approximate a quadratic so that margins generally decrease and then increase around a maximum, the interior results are stable and that's very desirable*Consistent solutions are good within cities - it's nice when adjacent years in the same city have more or less the same layer selected*Consistent solutions across cities are also good - When the method picks roughly similar layer indicies (i.e., % unclustered) across cities, particularly conceptually similar cities, that's a plus*From other analysis, I know that the equilibrium of agglomeration forces occurs when agglomerations have fairly small average hull sizes, perhaps on the order of 10hm2.
===Version 3.5 build notes===
*Component 3 is driven by the '''total hull area'''
=Previous Version2 Build=
==Target Journal==
=Old Work Using Circles=
 
See: [[Enclosing Circle Algorithm]]
==Very Old Summary==

Navigation menu