Changes

Jump to navigation Jump to search
no edit summary
This paper is published as:
 
[[Delineating Spatial Agglomerations|Egan, Edward J. and James A. Brander (2022), "New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups.", Journal of Economic Geography, Manuscript: JOEG-2020-449.R2, forthcoming.]]
 
{{AcademicPaper
|Has title=Urban Start-up Agglomeration and Venture Capital Investment
|Has author=Ed Egan,Jim Brander
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden, Jeemin Sim
|Has paper status=Working paperPublished
}}
=Working PaperNew SubmissionA revised version of the paper, now co-authored with [[Jim Brander]] and based on the version 3 rebuild, was submitted to the Journal of Economic Geography. This is solely a methods paper, and is titled: '''A New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups'''. The policy application would need to be written up as a separate paper. ==Acceptance== On July 5th 2022, the paper was accepted to the Journal of Economic Geography: * Manuscript ID JOEG-2020-449.R2* Title: A New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups* Author(s): Edward J. Egan and James A. Brander.* Editor: Bill Kerr, HBS: wkerr@hbs.edu* Abstract: This paper advances a new approach using hierarchical cluster analysis (HCA) for identifying and delineating spatial agglomerations and applies it to venture-backed startups. HCA identifies nested clusters at varying aggregation levels. We describe two methods for selecting a particular aggregation level and the associated agglomerations. The “elbow method” relies entirely on geographic information. Our preferred method, the “regression method”, also uses venture capital investment data and identifies finer agglomerations, often the size of a small neighborhood. We use heat maps to illustrate how agglomerations evolve and indicate how our methods can aid in evaluating agglomeration support policies.* Permanent link for code/data: https://www.edegan.com/wiki/Delineating_Spatial_Agglomerations The paper is now in production. I will build a wiki page called [[Delineating_Spatial_Agglomerations]] that structures the documentation of the build process and shares code and some data or artifacts. Currently, that page redirects here.
The last version of the paper, with the Houston narrative as the motivation, is available from SSRN: https://papers.ssrn.com/abstract=3537162= R&R ==
The Management Science submission version has Files:*Pdf: [[File:Egan Brander (2020) - A New Method for Identifying and Delineating Spatial Agglomerations (Submitted to JEG).pdf]]* In E:\projects\agglomeration** Last document was Agglomeration Dec 15.docx** Build is Version 3-6-2-2. ** SQL file is: AgglomerationVcdb4.sql After some inquiries, we heard from Bill Kerr, the associate editor, that the paper had new reviews on Aug 11th. On Aug 23rd, we recieved an email titled "Journal of Economic Geography - Decision on Manuscript ID JOEG-2020-449" giving us an R&R. Overall, the R&R is very positive. Bill's comments:* Referees aligned on central issue of Census places* Too short: Wants application and suggests ("not contractual requirements"):** Diversity within and between in terms of types of VC investment (e.g., Biotech vs. ICT in Waltham)** Patent citiations made between VC backed firms Reviewer 1's comments (excluding minor things):* Explain projection (should have said it was WGS1984)* Starting units: Suggests MSA level. Suppose cities that are close... can we find cases?* Identify clusters that have grown over time* Maybe try a cluster-level analysis* Is ruling out the first second-difference too limiting? Can a more conventional front end city be a cluster? (Vegas, baby?, Or starting from CMSA, probably yes in some sense.)* Discuss cluster boundaries (they aren't hard and is fast: "think of these clusters as follows:the kernels or seeds of VC-backedstartup hotspots")
<pdf>FileReviewer 2's comments (excluding minor things):AgglomerationV8* Starting Units. Suggests MSA. * Explain R2 method better. He didn't say try cluster-Reducedlevel but that might be helpful to him too.pdf<* Change language (back) to microgeographies! (or startup neighborhoods). * Tighter connection to lit. He gives papers to start.* Discuss overlap of clusters (a la patent clustering). Check findings in Kerr and Kominers!!!* Discuss counterfactuals/pdf>cause-and-effect/application etc. Show/discuss that we didn't just find office parks.
A new version, written by Jim, is in the works!<pdf>File:JOEG1RndReviews.pdf</pdf>
=New Work=
===Heuristic LayerNotes for further improvement===
[[AgglomerationInflectionScatterPlotAllDataCirclesWe might want to add some things in/back in.png|500px|right]] I had previously calculated These include technical notes:*To do the HCA we used the heuristic layer by calculating AgglomerativeClustering method from the mean fracinhull sklearn.cluster library (iversion 0.e20., % of startups 1) in economic clusters) for each percentage of the layer index (ipython 3.e7.1, for 101 observations) with Ward linkage and then fitting a cubic connectivity set to itnone. This method is documented here: https://scikit-learn.org/stable/modules/clustering.html. I did this because excel canchecked some of the early results against an implementation of Ward't handle fitting a cubic to s method using the agnes function, available through the full cluster package, in R. https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/agnes*The data was assembled and processed in a Postgresql (version 10) database using PostGIS (version 2.4). We used World Geodetic System revision 84, known as WGS1984 (isee https://en.ewikipedia.org/wiki/World_Geodetic_System), all 148as a coordinate system with an ellipsoidal earth,556 cityto calculate distances and areas (see https://postgis.net/docs/manual-year-layers2.4/using_postgis_dbmanagement.html). However, it is incorrect - I should have used Shapefiles for Census Places were retrieved from the median fracinhull, U.S. Census TIGER (Topologically Integrated Geographic Encoding and even that would have been slightly wrong because of orthogonality issues Referencing) database (see https://www.census.gov/programs-surveys/geography.html).*The statistical analysis was done in calculating mean square distancesSTATA/MP version 15. So I redid the plot *All maps were made using all the data, and calculated the cubic in STATA insteadQGIS v3.8.3. The base map is from Google Maps. See: '''inflectionCity areas are highlighted using U.do''' and '''inflectionS.log'''Census TIGER/Line Shapefiles.
The old result is methodology has other applications:*Food deserts - one could study the agglomerations of restaurants and other food providers in [https://wwwurban environments.edegan.com*Airports, cement factories, banana plantations, police/wikifire stations, hospitals/Urban_Start-up_Agglomeration_and_Venture_Capital_Investment#Fixing_an_issue Fixing an issue] belowdrug stores, and is x≈0etc.483879*We could think about commercial applications. The corrected result is x≈0.487717 (note Perhaps locating plants/facilities that R2 has dropped are/aren't in clusters with a view to 92.43%):buying or selling them?
:2.737323 x^3 - 4.005114 x^2 + 0.3262405 x + 0.9575088≈0.481497 at x≈0.487717 [https://www.wolframalpha.com/input/?i=inflection+points+2.737323+x3+-+4.005114x2++%2B+0.3262405x+%2B+0.9575088]SSRN version of the paper (uses v2 build)=
I also calculated an There are two 'final''inflectionlayer''' (as opposed to papers based on the heurflhlayer, where flh stands for fraction of locations in hulls, described above)version 2 build. This inflectionlayer is '''the first time''' that The one with Houston narrative as the second central difference in the '''share of startups in economic clusters''' switches sign. It motivation is only possible to calculate this when there are at least 4 data points, as the central difference requires data available from layer-1, layer and layer+1, and we need two central differences. The variable is included in dataset (and so do files, etcSSRN: https://papers.) version 3-4 forwardsssrn.com/abstract=3537162
However, the inflectionlayer is really meaningless. The sign of the second central switches back and forward due to integer effects Management Science submission version has a more conventional front end and I can't find a straight forward algorithm to pick the "correct" candidate from the set of results. Picking the '''first one''', which I currently pick, is completely arbitrary. There are a bunch of examples of the curves and the issue(s) in Results3-4.xlsx sheet 'Inflection'. I expect that if I put a bunch of time into this I could come up with some change thresholds to rule candidate answers in or out, but even then this isn't a good method. as follows:
Ultimately, the individual city<pdf>File:AgglomerationV8-year curves (i.e., across layers within a city-year) are just way too noisy. A variant of this noise problem is what makes the elbow method so problematic, but the noise is even worse with the inflection method. Using the heuristic result above solves this noise problem by aggregating city-years togetherReduced.pdf</pdf>
One complaint made about the heuristic results is that it is near the middle (i.e., it's 48.7717%, which happens to be near 50%). Although the nature of any HCA on geographic coords implies that the result is unlikely to the close to the bounds (0 or 100%) and more likely to be near the middle (50%), it could be in an entirely different place. '''This result (i.e., the heuristic layer at 48.7717%) characterizes the agglomeration of venture-backed startup firms'''. You'd get a very different number if you studied gas stations, supermarkets, airports, or banana plantations!=Version 3 Rebuild=
{{Colored box|title=The Case for the Heuristic Method|content=The heuristic method (i.e., using the inflection in the plot from the population =Another round of city-year-layers) finds pretty much the same layer as the R2 method with almost no work, and it can be used in a within-city analysis without having to hold hull count constant.}}refinements===
. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regmaxr2==1#The elbow method has issues in its current form, stats(p50 > mean sd N min max p10 p90) columnsso we are going to try using the elbow in the curvature (statisticsdegree of concavity) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 3.531407 7instead.07922 2977 1 68 1 6 tothullcount | 8 17.4565 35.65118 2977 3 380 3 30 tothullarea | 14.76523 448.029 2063.824 2977 .0049029 34780#We might also try using elasticities.04 .5275311 732.4005 tothullden~y | .7640136 11.32988 63.62256 2977 .0002282 1425.338 .0115537 16.15439 growthinv18 | 33.53101 142.5 561.6696 2977 0 22282.6 1.53118 309.0208 numdeals | 3 6.71347 17.06682 2977 0 275 0 15 numstartups | 16 41.28955 89.98027 2977 6 1317 7 90 -----------------------------------------------#Rerun the distance calculations ----------------------------------------------- . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regheur1==1, statsavghulldisthm and avgdisthm are only computed for layers that we select with some method (p50 > mean sd N min like max p10 p90r2) columns. However, this table hadn't been updated for the elbow method, perhaps as well as some other methods, so some distances would have been missing (statisticsand replaced with zeros in the STATA script). variable | p50 mean sd N min #Create and run the new max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 4R2 layer.279958 8.433203 3797 0 119 1 9 tothullcount | 8 20.08954 42.99372 3797 0 673 3 43 tothullarea | 11.32983 49.42803 158.7375 3797 0 2569.169 1.660208 93.94627 tothullden~y | .946713 3.48483 10.93185 3797 0 212.8198 .06182 7.601018 growthinv18 | 31.8453 133.0608 508.1196 3797 0 22282.6 1.235763 292.4397 numdeals | 2 6.629181 16.46614 3797 0 275 0 15 numstartups | 15 38.74743 83.6814 3797 6 1317 7 83 ----------------------------------------------------------------------------------------------In this variant, we'll use "the first layer a cluster number is reached as the representative layer for that cluster number"
===Another list of items===I built two new curvature based elbow methods and so variables: curvaturelayer and curvaturelayerrestricted. They use the method described below and are identical except that curvaturelayerrestricted can't select layer 2 (both can't select the first and last layers as they use central second differences).
For the example cities we have:{| class="wikitable" style="vertical-align:bottom;"|-! place! statecode! year! numstartups! elbowlayer! finallayer! curvaturelayer|-| Menlo Park| CA| 2,006| 68| 4| 51| 4|-| San Diego| CA| 2,006| 220| 3| 184| 181|-| Campbell| CA| 2,006| 38| 3| 26| 8|-| Charlotte| NC| 2,006| 30| 3| 30| 28|-| Waltham| MA| 2,006| 106| 3| 58| 55|} For these city-years, the curvaturelayer is the same as the curvaturelayerrestricted. As you can see, it is all over the place! I really don't think we can say that this method 'works' for any real value of 'works'. There's a sheet (Curvature Raw Data Examples) in ResultsV3-6.xlsx, and there's graphs for the selected cities on sheet "Elbow Curvature Selected Cities". ====New MaxR2 Layer==== I noticed a copy and paste error in the do file and I re-ran the existing max R2 method too, just to be sure. My process for the new method uses the code for the old chosenhullflayer variable. Key variables are:*firstlayer the layer at which numclusters first achieves that value*regfirst an indicator to select the right set of layers to run the max r2 estimation on*chosenhullflayer - the variable that records the layer number selected using firstlayer and the max r2 method*besthullflayer - the equivalent to besthulllayer but with the first layers instead of the lowest-highest ones*targetnumclustersf, besthullflayerisadded, maxr2flayerflag, etc.*'''regmaxr2f''' and '''regbestf''' - these are the dataset constraints to use. Everything is pushed through the database and back to generate them. The results for our sample cities are as follows:{| class="wikitable" style="vertical-align:bottom;"|-! place! statecode! year! finallayer! chosenhulllayer! style="font-weight:bold;" | chosenhullflayer! elbowlayer|-| Campbell| CA| 2,006| 26| 15| style="font-weight:bold;" | 3| style="font-weight:bold;" | 3|-| Charlotte| NC| 2,006| 30| 14| style="font-weight:bold;" | 3| style="font-weight:bold;" | 3|-| Menlo Park| CA| 2,006| 51| 33| style="font-weight:bold;" | 21| style="font-weight:bold;" | 4|-| San Diego| CA| 2,006| 184| 141| style="font-weight:bold;" | 12| style="font-weight:bold;" | 3|-| Waltham| MA| 2,006| 58| 31| style="font-weight:bold;" | 3| style="font-weight:bold;" | 3|} I build the max R2 graphs in the sheet '''New MaxR2''' in ResultsV3-6.xlsx ====Jim's notes on the curvature==== Suppose we have a function f. Then what I have been calling the curvature is -f’’/f’. If f is a utility function this is the coefficient of absolute risk aversion and it has quite often been called curvature in that context. However, in differential geometry curvature is described differently, although it is quite similar. Mas-Collel and others have suggested calling -f’’/f’ the “degree of concavity” instead. I came across this definition on the internet: :“The degree of concavity is measured by the proportionate rate of decrease of the slope, that is, the rate at which the slope decreases divided by the slope itself.” The general rationale for using this measure is that it is invariant to scale, whereas the straight second derivative, f’’, is not invariant. The same applies to the second difference of course." So our measure is the second difference divided by the first difference. However, it is not clear whether we should divide by the initial first difference or the second first difference or the average. I initially assumed that we should use the initial first difference. I now think that is wrong as it can produce anomalies. I think we should use the second (or “current”) first difference as the base.  Here is some data I sent before: {| class="wikitable" style="text-align:right;"|- style="background-color:#FFF; color:#222;"| Layer| SSR| D1| D2| Concavity| Concavity|- style="text-align:left; background-color:#FFF; color:#222;"| style="text-align:right;" | 1| style="text-align:right;" | 0| style="vertical-align:bottom;" | | style="vertical-align:bottom;" | | style="vertical-align:bottom;" | | style="vertical-align:bottom;" | |- style="background-color:#FFF; color:#222;"| 2| 40| 40| -5| 0.13| 0.14|- style="background-color:#FFF; color:#222;"| 3| 75| 35| style="background-color:#FF0;" | -20| 0.57| 1.33|- style="background-color:#FFF; color:#222;"| 4| 90| 15| -12| style="background-color:#FF0;" | 0.80| style="background-color:#FF0;" | 4|- style="background-color:#FFF; color:#222;"| 5| 93| 3| -1| 0.33| 0.5|- style="background-color:#FFF; color:#222;"| 6| 95| 2| -1| 0.50| 1|-| style="background-color:#FFF; color:#222;" | 7| style="background-color:#FFF; color:#222;" | 96| style="background-color:#FFF; color:#222;" | 1| style="background-color:#FFF; color:#222;" | -1| style="background-color:#FFF; color:#222;" | 1.00| style="text-align:left;" | |} The column at the far right uses the second first difference as the base, which I now think is correct. The column second from the right uses the first first difference at the base. Just to be clear, for layer 2 the first difference is 40 – 0 = 40. For layer 3 the first difference is 75 – 40 = 35. Therefore, for layer 2, the second difference is 35 – 40 = -5. I think this is what you would call the “middle second difference”. It tells how sharply the slope falls after the current layer, which is what we want. To correct for scaling, we need to divide by a first difference. In the first concavity column, for layer 2 I use 5/40 = 0.125. For the last column for layer 2 I use 5/35 = 0.143. Both approaches have a local max at layer 4, which is what we want. However, the second column from the right has a global max at the last layer, which is certainly not what we want. But is can happen at the end where the increments are very small. So it seems pretty clear that we want to use the second first difference at the base. More precisely, to get the concavity for layer 3 we want to divide the middle second difference by the forward first difference. (It would probably also be okay to use the middle second difference divided by the middle first difference, but I have not checked that out). =====Formalizing Jim's Notes===== Jim calculates the following (examples using layer 2):*The '''first-order backward difference''' in column '''D1''': <math>f(x)-f(x-1) = 40-0=40</math>*The '''second-order central difference''' in column '''D2''': <math>f(x+1)-2f(x)+f(x-1) = 75-2x40+0 = -5</math>*'''Concavity''' (in col5) as -D2_l/D1_l, or -1 times the backward first over the central second: <math> --5/40 = 0.125 \approx 0.13</math>*'''Concavity''' (in col6) as -D2_l/D1_{l+1}, or -1 times the central first over the central second: <math> --5/35 = 0.43 \approx 0.14</math> The concavity measure in col6 is therefore the -1 times central first difference divided by the central second difference, but the central first isn't computable for a step of 1 (and gives a weird answer anyway, as it straddles the observation in question). The central second difference isn't defined for either the first or last layer, and the backward first difference isn't defined for the first layer. It seems likely that we don't want the last layer and might get it because D1 is small and drives the ratio.  We could instead use the forward first difference - this isn't available for the last observation (for which we can't compute a second central anyway) but is available for the first observation - and increment the answer, much as Jim proposes decrementing it when using the backward layer. But seeing as we can't use the first observation we've gained nothing anyway! So we'll do Jim's method verbatim, and declare the result null if it comes out as either the first or last layer. ====Curvature==== {{Colored box|title=Specification|content=For layer <math>l</math>, I compute the curvature as -1 times the backward first difference in the variance explained ratio from layer <math>l+1</math> divided by the central second difference in the variance explained ratio from <math>l</math>. The first and last layers are forbidden results.}} The curvature results seem somewhat better than the elbow results but are still far from ideal. Here are some things I look for and/or don't like in a layer selection method:*Interior solutions are good, collapsing to the bounds, especially the lower bound is bad*Stable interior solutions are better - when the results approximate a quadratic so that margins generally decrease and then increase around a maximum, the interior results are stable and that's very desirable*Consistent solutions are good within cities - it's nice when adjacent years in the same city have more or less the same layer selected*Consistent solutions across cities are also good - When the method picks roughly similar layer indicies (i.e., % unclustered) across cities, particularly conceptually similar cities, that's a plus*From other analysis, I know that the equilibrium of agglomeration forces occurs when agglomerations have fairly small average hull sizes, perhaps on the order of 10hm2. ===Version 3.5 build notes=== In the process of building version 3.5, I noticed a discrepancy between tothulldensity and avghulldensity. This turned out to be correct. Both are measured in startups/hm2. Tothulldensity of the sum of the number of startups in hulls divided by the total hull area, whereas the avghulldensity is the average of the hull densities (computed as the number of startups in the hull divided by the hull area).  The revised script and dataset is v3-5. ResultsV3-5.xlsx has all of the old redundant results removed and has new sheets for Descriptives (copied over with renamed column names from Descriptives.xlx, which is generated by the .do file), as well as for the new scatterplot. Its Bar and Whisker is also stripped down to the bare essentials.  ===Heuristic Layer=== [[File:AgglomerationInflectionScatterPlotAllDataCircles.png|500px|right]] I had previously calculated the heuristic layer by calculating the mean fracinhull (i.e., % of startups in economic clusters) for each percentage of the layer index (i.e., for 101 observations) and then fitting a cubic to it. I did this because excel can't handle fitting a cubic to the full data (i.e., all 148,556 city-year-layers). However, it is incorrect because of orthogonality issues in calculating mean square distances (I'm also unsure that the mean would be the best measure of central tendency). So I redid the plot using all the data, and calculated the cubic in STATA instead. See: '''inflection.do''' and '''inflection.log'''. The old result is in [https://www.edegan.com/wiki/Urban_Start-up_Agglomeration_and_Venture_Capital_Investment#Fixing_an_issue Fixing an issue] below, and is x≈0.483879. The corrected result is x≈0.487717 (note that R2 has dropped to 92.43%): :2.737323 x^3 - 4.005114 x^2 + 0.3262405 x + 0.9575088≈0.481497 at x≈0.487717 [https://www.wolframalpha.com/input/?i=inflection+points+2.737323+x3+-+4.005114x2++%2B+0.3262405x+%2B+0.9575088] I also calculated an '''inflectionlayer''' (as opposed to the heurflhlayer, where flh stands for fraction of locations in hulls, described above). This inflectionlayer is '''the first time''' that the second central difference in the '''share of startups in economic clusters''' switches sign. It is only possible to calculate this when there are at least 4 data points, as the central difference requires data from layer-1, layer and layer+1, and we need two central differences. The variable is included in dataset (and so do files, etc.) version 3-4 forwards. However, the inflectionlayer is really meaningless. The sign of the second central switches back and forward due to integer effects and I can't find a straight forward algorithm to pick the "correct" candidate from the set of results. Picking the '''first one''', which I currently pick, is completely arbitrary. There are a bunch of examples of the curves and the issue(s) in Results3-4.xlsx sheet 'Inflection'. I expect that if I put a bunch of time into this I could come up with some change thresholds to rule candidate answers in or out, but even then this isn't a good method.  Ultimately, the individual city-year inflection curves (i.e., across layers within a city-year) are just way too noisy. A variant of this noise problem is what makes the elbow method so problematic, but the noise is even worse with the inflection method. Using the heuristic result above (i.e., the one using all city-years) solves this noise problem by aggregating city-years together. One complaint made about the heuristic results is that it is near the middle (i.e., it's 48.7717%, which happens to be near 50%). Although the nature of any HCA on geographic coords implies that the result is unlikely to the close to the bounds (0 or 100%) and more likely to be near the middle (50%), it could be in an entirely different place. '''This result (i.e., the heuristic layer at 48.7717%) characterizes the agglomeration of venture-backed startup firms'''. You'd get a very different number if you studied gas stations, supermarkets, airports, or banana plantations! ====Comparing the Heuristic and R2 Layers==== {{Colored box|title=The Case for the Heuristic Method|content=The heuristic method (i.e., using the inflection in the plot from the population of city-year-layers) finds pretty much the same layer as the R2 method with almost no work, and it can be used in a within-city analysis without having to hold hull count constant.}}  . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regmaxr2==1, stats(p50 > mean sd N min max p10 p90) columns(statistics) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 3.531407 7.07922 2977 1 68 1 6 tothullcount | 8 17.4565 35.65118 2977 3 380 3 30 tothullarea | 14.76523 448.029 2063.824 2977 .0049029 34780.04 .5275311 732.4005 tothullden~y | .7640136 11.32988 63.62256 2977 .0002282 1425.338 .0115537 16.15439 growthinv18 | 33.53101 142.5 561.6696 2977 0 22282.6 1.53118 309.0208 numdeals | 3 6.71347 17.06682 2977 0 275 0 15 numstartups | 16 41.28955 89.98027 2977 6 1317 7 90 ---------------------------------------------------------------------------------------------- . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regheur1==1, stats(p50 > mean sd N min max p10 p90) columns(statistics) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 4.279958 8.433203 3797 0 119 1 9 tothullcount | 8 20.08954 42.99372 3797 0 673 3 43 tothullarea | 11.32983 49.42803 158.7375 3797 0 2569.169 1.660208 93.94627 tothullden~y | .946713 3.48483 10.93185 3797 0 212.8198 .06182 7.601018 growthinv18 | 31.8453 133.0608 508.1196 3797 0 22282.6 1.235763 292.4397 numdeals | 2 6.629181 16.46614 3797 0 275 0 15 numstartups | 15 38.74743 83.6814 3797 6 1317 7 83 ---------------------------------------------------------------------------------------------- Analyzing layers: Method Avg. Layer Index Std. Dev Layer Index Max R2 0.392473192 0.2380288695 Heuristic 0.43423652 0.0495630531  '''The Max R2 and Heuristic layers are identical in 12.6% of cases!''' Some of these cases are found in city-years with a large number of layers, for instance, there are 90 city-years that have more than 20 startups and identical heuristic and max r2 layers. The table below shows city-years with more than 50 startups and identical heuristic and max R2 layers: {| class="wikitable" |- style="font-weight:bold;"! place! statecode! year! numstartups! chosenhulllayer! heurflhlayer|-| San Francisco| CA| 2,009| 503| 175| 175|-| Los Angeles| CA| 2,012| 213| 93| 93|-| Redwood City| CA| 2,012| 151| 49| 49|-| Redwood City| CA| 2,013| 151| 49| 49|-| Seattle| WA| 2,000| 113| 48| 48|-| Houston| TX| 2,007| 92| 40| 40|-| Waltham| MA| 2,012| 73| 24| 24|-| Pittsburgh| PA| 2,008| 70| 25| 25|-| Bellevue| WA| 2,001| 64| 25| 25|-| Bellevue| WA| 2,003| 61| 23| 23|-| Pleasanton| CA| 2,004| 54| 20| 20|-| Menlo Park| CA| 2,004| 52| 22| 22|-| Durham| NC| 2,009| 50| 22| 22|} In fact, 84% of city-years (which have both heuristic and max R2 layers) have heuristic and max R2 layers that are separated by less than or equal to 5 layers, and 59% have them separated by less than or equal to 2 layers! '''More than a third (36.3%) of city-years have their heuristic and max R2 layers separated by less than or equal to 1 layer.''' ===Another list of items=== Jim asked for the following (in order of delivery schedule, not importance):#A dataset and STATA do file and to implement table 5, complete with an exploration of which regressors to include#An implementation of the 'real elbow method', then integration with (1).#A (set of) comparison(s) between the max R2 method and the elbow methods#A new heatmap or two, based on a different location. All done... see the sections below. ====Heatmaps==== I built '''unbuffered heatmaps using maximum R2 layers from 1995 to 2018''' for a set of "interesting" cities. I often built the same city at multiple scales. Only the zoomed-in maps are in the gallery below. I can now quite quickly build more cities if needed. It is worth noting the following:*Because we are using unbuffered hulls, heatmap components are angular and non-diffuse.*Agglomerations are smaller in cities with higher startup counts but are small everywhere. *Agglomerations don't come close to overlapping city boundaries. Agglomerations within Palo Alto don't overflow into Mountain View and it isn't meant meaningful to talk about Boston-Cambridge agglomerations, except as a broad set. An agglomeration is typically a few square blocks (we knew this from the mean and median hull sizes). *Some famous policy interventions appear to have no effect. There is no agglomeration, let alone a concentration of them, in Boston's North End, where hundreds of millions were plowed into a TIF (and MassChallenge). <gallery widths=300 heights=300>File:Bellevue125000MaxR2UnbufferedHeatmap.png| Bellevue, WA, 1:125kFile:PaloAlto50000MaxR2UnbufferedHeatmap.png| Palo Alto, CA, 1:50kFile:Boulder50000MaxR2UnbufferedHeatmap.png| Boulder, CO, 1:50kFile:Waltham65000MaxR2UnbufferedHeatmap.png| Waltham, MA, 1:65kFile:Boston50000MaxR2UnbufferedHeatmap.png| Boston, MA, 1:50k</gallery> I also built three buffered heatmaps of Boston as a proof of concept. I used either the average distance between the points on the edge of the hull and the centroid, or half of it, as a buffering distance. I also varied the intensity of the shading (down to 10% per layer from 20% in the 1:70000 image). Boston should have 17 agglomerations according to the maximum R2 method, so the half distance buffer might be best for picking them out. <gallery widths=300 heights=300>File:Boston70000MaxR2buffered1xHeatmap.png| Boston, MA, 1:70k, 1x buffer, 10% opacityFile:Boston50000MaxR2bufferedHalfxHeatmap.png| Boston, MA, 1:50k, 0.5x buffer, 20% opacityFile:Boston50000MaxR2buffered1xHeatmap.png| Boston, MA, 1:50k, 1x buffer, 20% opacity</gallery> ====Comparing the Methods==== Summaries of the meta-data on geometries created by each lens is probably the best method of comparison. These are in the do file: . //Compare how their lenses look:
. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regmaxr2==1, stats(p50
> mean sd N min max p10 p90) columns(statistics)
*Component 3 is driven by the '''total hull area'''
=Previous Version2 Build=
==Target Journal==
=Old Work Using Circles=
 
See: [[Enclosing Circle Algorithm]]
==Very Old Summary==

Navigation menu