Urban Start-up Agglomeration and Venture Capital Investment

From edegan.com
Jump to navigation Jump to search
Academic Paper
Title Urban Start-up Agglomeration and Venture Capital Investment
Author Ed Egan, Jim Brander
RAs Peter Jalbert, Jake Silberman, Christy Warden, Jeemin Sim
Status Working paper
© edegan.com, 2016


Working Paper

The last version of the paper, with the Houston narrative as the motivation, is available from SSRN: https://papers.ssrn.com/abstract=3537162

The Management Science submission version has a more conventional front end and is as follows:

A new version, written by Jim, is in the works!

New Work

Version 3.5 build notes

In the process of building version 3.5, I noticed a discrepancy between tothulldensity and avghulldensity. This turned out to be correct. Both are measured in startups/hm2. Tothulldensity of the sum of the number of startups in hulls divided by the total hull area, whereas the avghulldensity is the average of the hull densities (computed as the number of startups in the hull divided by the hull area).

The revised script and dataset is v3-5. ResultsV3-5.xlsx has all of the old redundant results removed and has new sheets for Descriptives (copied over with renamed column names from Descriptives.xlx, which is generated by the .do file), as well as for the new scatterplot. Its Bar and Whisker is also stripped down to the bare essentials.

Heuristic Layer


I had previously calculated the heuristic layer by calculating the mean fracinhull (i.e., % of startups in economic clusters) for each percentage of the layer index (i.e., for 101 observations) and then fitting a cubic to it. I did this because excel can't handle fitting a cubic to the full data (i.e., all 148,556 city-year-layers). However, it is incorrect because of orthogonality issues in calculating mean square distances (I'm also unsure that the mean would be the best measure of central tendency). So I redid the plot using all the data, and calculated the cubic in STATA instead. See: inflection.do and inflection.log.

The old result is in Fixing an issue below, and is x≈0.483879. The corrected result is x≈0.487717 (note that R2 has dropped to 92.43%):

2.737323 x^3 - 4.005114 x^2 + 0.3262405 x + 0.9575088≈0.481497 at x≈0.487717 [1]

I also calculated an inflectionlayer (as opposed to the heurflhlayer, where flh stands for fraction of locations in hulls, described above). This inflectionlayer is the first time that the second central difference in the share of startups in economic clusters switches sign. It is only possible to calculate this when there are at least 4 data points, as the central difference requires data from layer-1, layer and layer+1, and we need two central differences. The variable is included in dataset (and so do files, etc.) version 3-4 forwards.

However, the inflectionlayer is really meaningless. The sign of the second central switches back and forward due to integer effects and I can't find a straight forward algorithm to pick the "correct" candidate from the set of results. Picking the first one, which I currently pick, is completely arbitrary. There are a bunch of examples of the curves and the issue(s) in Results3-4.xlsx sheet 'Inflection'. I expect that if I put a bunch of time into this I could come up with some change thresholds to rule candidate answers in or out, but even then this isn't a good method.

Ultimately, the individual city-year inflection curves (i.e., across layers within a city-year) are just way too noisy. A variant of this noise problem is what makes the elbow method so problematic, but the noise is even worse with the inflection method. Using the heuristic result above (i.e., the one using all city-years) solves this noise problem by aggregating city-years together.

One complaint made about the heuristic results is that it is near the middle (i.e., it's 48.7717%, which happens to be near 50%). Although the nature of any HCA on geographic coords implies that the result is unlikely to the close to the bounds (0 or 100%) and more likely to be near the middle (50%), it could be in an entirely different place. This result (i.e., the heuristic layer at 48.7717%) characterizes the agglomeration of venture-backed startup firms. You'd get a very different number if you studied gas stations, supermarkets, airports, or banana plantations!

Comparing the Heuristic and R2 Layers

The Case for the Heuristic Method
The heuristic method (i.e., using the inflection in the plot from the population of city-year-layers) finds pretty much the same layer as the R2 method with almost no work, and it can be used in a within-city analysis without having to hold hull count constant.
. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regmaxr2==1, stats(p50
>  mean sd N min max p10 p90) columns(statistics)

    variable |       p50      mean        sd         N       min       max       p10       p90
      nohull |         2  3.531407   7.07922      2977         1        68         1         6
tothullcount |         8   17.4565  35.65118      2977         3       380         3        30
 tothullarea |  14.76523   448.029  2063.824      2977  .0049029  34780.04  .5275311  732.4005
tothullden~y |  .7640136  11.32988  63.62256      2977  .0002282  1425.338  .0115537  16.15439
 growthinv18 |  33.53101     142.5  561.6696      2977         0   22282.6   1.53118  309.0208
    numdeals |         3   6.71347  17.06682      2977         0       275         0        15
 numstartups |        16  41.28955  89.98027      2977         6      1317         7        90

. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regheur1==1, stats(p50
>  mean sd N min max p10 p90) columns(statistics)

    variable |       p50      mean        sd         N       min       max       p10       p90
      nohull |         2  4.279958  8.433203      3797         0       119         1         9
tothullcount |         8  20.08954  42.99372      3797         0       673         3        43
 tothullarea |  11.32983  49.42803  158.7375      3797         0  2569.169  1.660208  93.94627
tothullden~y |   .946713   3.48483  10.93185      3797         0  212.8198    .06182  7.601018
 growthinv18 |   31.8453  133.0608  508.1196      3797         0   22282.6  1.235763  292.4397
    numdeals |         2  6.629181  16.46614      3797         0       275         0        15
 numstartups |        15  38.74743   83.6814      3797         6      1317         7        83

Analyzing layers:

Method 	Avg. Layer Index	Std. Dev Layer Index
Max R2 	0.392473192     	0.2380288695	
Heuristic	0.43423652      	0.0495630531	

The Max R2 and Heuristic layers are identical in 12.6% of cases! Some of these cases are found in city-years with a large number of layers, for instance, there are 90 city-years that have more than 20 startups and identical heuristic and max r2 layers. The table below shows city-years with more than 50 startups and identical heuristic and max R2 layers:

place statecode year numstartups chosenhulllayer heurflhlayer
San Francisco CA 2,009 503 175 175
Los Angeles CA 2,012 213 93 93
Redwood City CA 2,012 151 49 49
Redwood City CA 2,013 151 49 49
Seattle WA 2,000 113 48 48
Houston TX 2,007 92 40 40
Waltham MA 2,012 73 24 24
Pittsburgh PA 2,008 70 25 25
Bellevue WA 2,001 64 25 25
Bellevue WA 2,003 61 23 23
Pleasanton CA 2,004 54 20 20
Menlo Park CA 2,004 52 22 22
Durham NC 2,009 50 22 22

In fact, 84% of city-years (which have both heuristic and max R2 layers) have heuristic and max R2 layers that are separated by less than or equal to 5 layers, and 59% have them separated by less than or equal to 2 layers! More than a third (36.3%) of city-years have their heuristic and max R2 layers separated by less than or equal to 1 layer.

Another list of items

Jim asked for the following (in order of delivery schedule, not importance):

  1. A dataset and STATA do file and to implement table 5, complete with an exploration of which regressors to include
  2. An implementation of the 'real elbow method', then integration with (1).
  3. A (set of) comparison(s) between the max R2 method and the elbow methods
  4. A new heatmap or two, based on a different location.

All done... see the sections below.


I built unbuffered heatmaps using maximum R2 layers from 1995 to 2018 for a set of "interesting" cities. I often built the same city at multiple scales. Only the zoomed-in maps are in the gallery below. I can now quite quickly build more cities if needed. It is worth noting the following:

  • Because we are using unbuffered hulls, heatmap components are angular and non-diffuse.
  • Agglomerations are smaller in cities with higher startup counts but are small everywhere.
  • Agglomerations don't come close to overlapping city boundaries. Agglomerations within Palo Alto don't overflow into Mountain View and it isn't meant meaningful to talk about Boston-Cambridge agglomerations, except as a broad set. An agglomeration is typically a few square blocks (we knew this from the mean and median hull sizes).
  • Some famous policy interventions appear to have no effect. There is no agglomeration, let alone a concentration of them, in Boston's North End, where hundreds of millions were plowed into a TIF (and MassChallenge).

I also built three buffered heatmaps of Boston as a proof of concept. I used either the average distance between the points on the edge of the hull and the centroid, or half of it, as a buffering distance. I also varied the intensity of the shading (down to 10% per layer from 20% in the 1:70000 image). Boston should have 17 agglomerations according to the maximum R2 method, so the half distance buffer might be best for picking them out.

Comparing the Methods

Summaries of the meta-data on geometries created by each lens is probably the best method of comparison. These are in the do file:

. //Compare how their lenses look:
. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regmaxr2==1, stats(p50
>  mean sd N min max p10 p90) columns(statistics)

    variable |       p50      mean        sd         N       min       max       p10       p90
      nohull |         2  3.531407   7.07922      2977         1        68         1         6
tothullcount |         8   17.4565  35.65118      2977         3       380         3        30
 tothullarea |  14.76523   448.029  2063.824      2977  .0049029  34780.04  .5275311  732.4005
tothullden~y |  .7640136  11.32988  63.62256      2977  .0002282  1425.338  .0115537  16.15439
 growthinv18 |  33.53101     142.5  561.6696      2977         0   22282.6   1.53118  309.0208
    numdeals |         3   6.71347  17.06682      2977         0       275         0        15
 numstartups |        16  41.28955  89.98027      2977         6      1317         7        90

. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regheur1==1, stats(p50
>  mean sd N min max p10 p90) columns(statistics)

    variable |       p50      mean        sd         N       min       max       p10       p90
      nohull |         2  4.305768  8.498714      3797         0       119         1         9
tothullcount |         8  20.27153  43.51017      3797         0       675         3        43
 tothullarea |  11.43336  49.81455  159.0983      3797         0  2569.169  1.661926  94.14735
tothullden~y |  .9422739  3.452804  10.89704      3797         0  212.8198    .06182   7.47113
 growthinv18 |   31.8453  133.0608  508.1196      3797         0   22282.6  1.235763  292.4397
    numdeals |         2  6.629181  16.46614      3797         0       275         0        15
 numstartups |        15  38.74743   83.6814      3797         6      1317         7        83

. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regelbow==1, stats(p50
>  mean sd N min max p10 p90) columns(statistics)

    variable |       p50      mean        sd         N       min       max       p10       p90
      nohull |         2  3.152638  3.973168      3374         0        48         0         7
tothullcount |        12  37.32721   87.6828      3374         0      1303         0        91
 tothullarea |  55.78572   898.919  3849.938      3374         0  74067.25         0  1589.324
tothullden~y |   .169715  31.79024  1726.042      3374         0  100257.3         0  1.841935
 growthinv18 |  36.56511  146.5069  537.3532      3374         0   22282.6  1.816288  326.6357
    numdeals |         3  7.232662  17.32508      3374         0       275         0        16
 numstartups |        17  42.33225  88.08184      3374         6      1317         7        98

Another way to compare the methods is to look at the layers they select. This is visible in a box plot as well as summary statistics. The following is from ResultsV3.xlsx

Separate Samples (All Available City-Years)

Method N Avg Layer Index Std. Dev Layer Index
Maximum R2 3080 39% 18%
Startups in Clusters Inflection 6,743 35% 16%
Variance Explained Elbow 4,799 43% 30%

Using Common City-Years

Method N Avg Layer Index Std. Dev Layer Index L Index < Peak L Index < Max R2 X < Max R2
Maximum R2 2662 40% 24% 0 2662
Startups in Clusters Inflection 2662 44% 5% 1102 167 6%
Variance Explained Elbow 2662 31% 22% 53 297 11%

Finally, we can look at a city where different methods select different layers and look at those layers. Here's Cincinnati, Ohio, in 2018:

Implementing the Real Elbow Method

I calculated the between and within-cluster variances, as described below, using the Euclidean distance by using the ST_Distance function on PostGIS geographies (i.e., accounting for an ellipsoid earth using reference system WGS1984).

The output of the python HCL clustering script has around 40m observations (place-statecode, year, layer, cluster, startup), and some of the intermediate tables took several minutes to build. As the process should be O(n), this process could accommodate input data that is perhaps 100x bigger, assuming a patient researcher, which would imply source data perhaps 10x bigger. Note that the hardware/software that we are running this on is pretty close to the (current) frontier.

Fixing an issue

The within-cluster variance (and so F-stat and variance explained) revealed an issue with the data that had to be fixed: The Python HCA script forces the decomposition of multitons into singletons at the end of its run! We want to stop the HCA when we have every location in a separate point, rather than artificially forcing startups with the same location into separate points. This issue likely directly affects the heuristic method(s) that rely on layer indices and indirectly (by changing observation counts) affects the maximum r2 layer choice.

I pushed through the change and reran everything. It is build version 3.1, and includes a new .do file, new .txt data files, and a new .log file.

The new elbow layer is: 2.5795 x^3 - 3.7445 x^2 + 0.1989 x + 0.9808≈0.492554 at x≈0.483879 [2].


The results in the section below are outdated! The updated results are similar but not the same. Do not rely on the results on the wiki page.

Always check the log file in the dropbox (or in E:\projects\agglomeration) for the latest results!
Trying to find the elbow

The objective is to apply the Elbow Method, which involves finding the Knee of the curve of either the F-statistic or variance explained.

I used distances calculated by ST_Distance and calculated the variance explained using the following equations:

[math]SS_{exp}=\sum_{i=1}^{K} n_i(\bar{Y}_{i\cdot} - \bar{Y})^2[/math]
[math]SS_{unexp}=\sum_{i=1}^{K}\sum_{j=1}^{n_{i}} \left( Y_{ij}-\bar{Y}_{i\cdot} \right)^2[/math]
[math]R^2 = \frac{SS_{exp}}{SS_{exp}+SS_{unexp}}[/math]

I then calculated forward differences, and added one to the answer, as using central differences left truncates the data. (An inspection of the data revealed that it is vastly more likely that the 'correct' answer is found at the left end of the data than the right. Also central first difference bridge the observation, which can lead to misidentification of monotonicity.) Specifically, I used:

[math] f'(x) = f(x + 1) - f(x) [/math]
[math] f''(x) = f(x+2) - 2 f(x+1) + f(x)[/math]

See https://en.wikipedia.org/wiki/Finite_difference

I required that a city-year had more than two layers, as it takes at least 3 layers to form an elbow. I then used [math]f'(x)[/math] to determine the layer index from which the variance explained was monotonic (i.e., there was no change in sign in [math]f'(x)[/math] in higher layer indices). This wasn't an issue when using the population variance explained. In an earlier version, when we used the sample variance explained, we had some non-monotonic sections of the curve resulting from integer division ([math]\frac{k-1}{n-k}[/math]).

I used [math]f''(x)[/math] to find the layer index [math]i[/math] at which [math]varexp_i = min(varexp)[/math] (for elbowlayer) or for which [math]varexp_i = max(varexp)[/math] (for elbowmaxlayer), for some city-year. I then marked [math]i+1[/math] as the elbow (or elbowmax) layer for that city-year, as we are using forward differences, not central differences. Note that the biggest change in slope could be found using max(abs(f(x))) but this is essentially always min(f(x)), i.e., the elbow layer, as the change in slopes are mostly negative. However, the changes in slopes do often go positive, and the elbowmax layer captures the biggest positive change in slope.

I created a new build (version 3.3) of the dataset, do file and log file, which includes the population variance explained elbow method, as well as the elbowmax method. It's in the dropbox..

Note that the lens found by the population elbow method is slightly bigger than the lenses found using sample elbow method from before, but the lens found using the elbowmax method is about the same size as the sample elbow method, if not slightly smaller. I'm not sure about the justification of the elboxmax method though.

Fixing the layer index

I checked the implementation of the % layer index again, and fixed a mistake in it.

In Heurcalc, I define fracunclusted as:

 CASE WHEN finallayer >1 THEN ((layer::real-1)/(finallayer::real-1)) ELSE NULL END AS fracunclustered,

I then previously select the layer below or equal to % layer index 0.374833 in Heurflh:

 CASE WHEN floor((finallayer-1)*0.374833)=0 THEN 1::int ELSE floor((finallayer-1)*0.374833) END AS heurflhlayer

This was incorrect. With some rearrangement, we can see: [math]index \le 0.374833 \implies \frac{layer-1}{final-1} \le 0.374833 \implies layer \le 0.374833(final-1) + 1 [/math]

I therefore fixed the query to:

floor(((finallayer-1)*0.374833)+1) AS heurflhlayer

I've updated the results in the New Do File section, and the two methods are now much closer to giving the same lens (see the descriptives!).

New Do File

There's a new do file, dataset and log file in the dropbox. The do file is reorganized and condensed into a single file. In order to select layers for specific purposes, like regressions, the do file uses flags (defined on lines 165-190). Do not try to run regressions using xt commands even with these flags, as the underlying layers used will be incorrect. Instead, put in the fixed effects yourself (using i.var) and put them at the front of the regression. STATA decides which variables to omit before doing the linear algebra, so you'll be able to get non-zero coefficients on omitted variables if you put them ahead of the fixed effects that would wipe them out.

The actual regressions I chose are on lines 243-250. There have the log of next period growth VC per startup as the dependent variable. Each spec uses year and city fixed effects, and clusters the standard errors at the city level. However, only the second spec in each sequence uses explicit scale controls.

Here's extracts of the results for Max R2:

reg growthinv18lfperstartup i.year i.placeid nohull tothullcountl tothullareal tothulldensityl avghulldisthml if regmaxr2==1, cluster(placeid)

         nohull |          0  (omitted)
  tothullcountl |  -.0731116   .0170361    -4.29   0.000    -.1067484   -.0394748
   tothullareal |   .0166992   .0032354     5.16   0.000     .0103111    .0230873
tothulldensityl |  -.0026603   .0034181    -0.78   0.438    -.0094092    .0040885
 avghulldisthml |   .0025771   .0060088     0.43   0.669    -.0092869    .0144411

reg growthinv18lfperstartup i.year i.placeid nohull tothullcountl tothullareal tothulldensityl avghulldisthml growthinv18l numstartupsl numdealsl if regmaxr2==1, cluster(placeid)

         nohull |   .0045458   .0007755     5.86   0.000     .0030165    .0060752
  tothullcountl |   .0042399   .0087493     0.48   0.629    -.0130145    .0214943
   tothullareal |   .0045121   .0042544     1.06   0.290    -.0038778    .0129021
tothulldensityl |   .0163778   .0048519     3.38   0.001     .0068095    .0259462
 avghulldisthml |  -.0024717   .0026637    -0.93   0.355    -.0077247    .0027813
   growthinv18l |   .0075835   .0025118     3.02   0.003     .0026301     .012537
   numstartupsl |  -.1492758    .016447    -9.08   0.000    -.1817106   -.1168411
     numdealsl |   .0121621    .005014     2.43   0.016      .002274    .0220501

Here's extracts of the results for the 1st Heuristic (Percentage of startups in clusters):

reg growthinv18lfperstartup i.year i.placeid nohull tothullcountl tothullareal tothulldensityl avghulldisthml if regheur1==1, cluster(placeid)

         nohull |   .0029323   .0007413     3.96   0.000     .0014704    .0043941
  tothullcountl |  -.0526231   .0113923    -4.62   0.000    -.0750896   -.0301567
   tothullareal |   .0155028   .0041201     3.76   0.000     .0073776     .023628
tothulldensityl |   .0057162   .0040038     1.43   0.155    -.0021797    .0136121
 avghulldisthml |  -.0022889   .0029852    -0.77   0.444    -.0081759     .003598

reg growthinv18lfperstartup i.year i.placeid nohull tothullcountl tothullareal tothulldensityl avghulldisthml growthinv18l numstartupsl numdealsl if regheur1==1, cluster(placeid)

         nohull |   .0047632   .0007554     6.31   0.000     .0032736    .0062528
  tothullcountl |   .0008837    .012177     0.07   0.942    -.0231303    .0248977
   tothullareal |   .0011863   .0042324     0.28   0.780    -.0071603    .0095329
tothulldensityl |   .0067998   .0043843     1.55   0.123    -.0018464     .015446
 avghulldisthml |  -.0063252   .0028367    -2.23   0.027    -.0119194   -.0007311
   growthinv18l |   .0076041   .0024995     3.04   0.003     .0026749    .0125333
   numstartupsl |  -.1450849   .0158493    -9.15   0.000     -.176341   -.1138289
      numdealsl |   .0124877   .0050118     2.49   0.014      .002604    .0223714

Lines 252 to 255 of the do file also compare the lense given by the max r2 method and the 1st heuristic method. The results are below. Note that the second spec shows the 1st heuristic for all city-years and the third spec shows it for the city-years that have a max R2 layer, to make a fairer comparison.

. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regmaxr2==1, stats(p50 mean sd N min max p10 p90) columns(statistics)

    variable |       p50      mean        sd         N       min       max       p10       p90
      nohull |         2  3.862758  7.279538      2951         1        67         1         7
tothullcount |         8  19.18841  36.62401      2951         3       369         3        37
 tothullarea |   15.2049  431.5689  2067.431      2951  4.04e-06  34780.04  .7658609  606.8908
tothullden~y |  .8115587  258.0355  13679.93      2951  .0002282  743141.7  .0123656  12.06713
 growthinv18 |  33.16967  141.5326  563.4799      2951         0   22282.6  1.412436  298.6927
    numdeals |         3  6.630973  17.17611      2951         0       275         0        14
 numstartups |        16   41.2528  90.67289      2951         6      1317         7        87
. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regheur1==1, stats(p50 mean sd N min max p10 p90) columns(statistics)

    variable |       p50      mean        sd         N       min       max       p10       p90
      nohull |         2  4.352647  8.107042      3797         0       108         1         9
tothullcount |         9  20.60706  40.26376      3797         0       566         4        42
 tothullarea |  18.88252   110.757  348.9474      3797         0  6079.538  1.523386  236.2767
tothullden~y |  .5860988  6.013232  43.74094      3797         0  1429.742  .0277004  7.329074
 growthinv18 |   31.8453  133.0608  508.1196      3797         0   22282.6  1.235763  292.4397
    numdeals |         2  6.629181  16.46614      3797         0       275         0        15
 numstartups |        15  38.74743   83.6814      3797         6      1317         7        83

. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regheur1==1 & compok==1 , stats(p50 mean sd N min max p10 p90) columns(statistics)

    variable |       p50      mean        sd         N       min       max       p10       p90
      nohull |         2  4.581159  8.548611      2951         0       108         1         9
tothullcount |         9  21.68519  42.78684      2951         0       566         4        42
 tothullarea |  19.43593  114.4633  360.3934      2951         0  6079.538   1.86059   252.267
tothullden~y |  .5797117  6.835335  49.19501      2951         0  1429.742  .0314084  7.848711
 growthinv18 |  33.16967  141.5326  563.4799      2951         0   22282.6  1.412436  298.6927
    numdeals |         3  6.630973  17.17611      2951         0       275         0        14
 numstartups |        16   41.2528  90.67289      2951         6      1317         7        87

A list of items


  1. Rerun the code with just the 8 geometry variables, and three factors, in the maximum R2 selection
  2. Fix the implementation of %unclustered (now called %complete)
  3. Compare the changed definitions and try some test regressions
  4. Write a paragraph justifying the 'heuristic' method.
  5. Write a couple of sentences about Guzman and Stern
  6. Explore if/how we could implement the variance-based elbow method

I also wanted to fix confusion between CSAs (Combined Statistical Areas)[3] and CMSAs (Consolidated Metropolitan Statistical Areas)[4]. CMSA redirects to CSA on Wikipedia. However, it is actually not clear if these are the same things. OMB is the originator of both terms[5].

New implementations

We're making two small but important changes to the implementations of:

  1. The maximum R2 method (now just 8 metadata vars)
  2. %Complete - now i-1/I-1
Max R2

I ran the code and pushed the results through the database. The newly selected layers are generally the same or one lower than the previously selected layers. I expect that this will be just fine.

Fraction Complete

Fraction complete isn't a variable in MasterLayersv3-0. It is used on the fly in queries to assess the elbow layer and then generated anew in AnalysisVcdb4.do (though it isn't really used there):

gen fracunclustered=layer/finallayer

I fixed the following, and reran the analysis using fracunclustered = (layer-1)/(finallayer-1):

  • elbowcalc->heurcalc
  • elbowdata->heurdata
  • Elbowflh-heurflh

Plotting the data from heurdata.txt in excel gives:

2.5944 x^3 - 2.9174 x^2 - 0.6407 x + 1.0304

Wolfram alpha[6] says the inflection point is now:

x≈0.374833, y≈0.516982

Note that this is now the fraction complete is relative to finallayer-1. For reference, the old value was 0.429388 relative to finallayer.

Implementing The Elbow Method

This section explores whether we could implement the actual elbow method (see https://en.wikipedia.org/wiki/Elbow_method_(clustering) ). The answer is that we might be able to, at least for some sub-sample of our data, but that it likely doesn't give us what we want.


The elbow method plots the number of clusters (on x) against the percentage of variance explained (on y) and finds the elbow. The elbow is the point at which the "diminishing returns [in variance explained] are no longer worth the additional cost [of adding another cluster]'. For the variance explained there are two main options:

  1. Variance explained = between-group variance / total variance
  2. Variance explained = between-group variance / within-group variance (Note that this is the ANOVA F-statistic).

Using the Law of total variance, total variance = between-group variance + within-group variance.

From Wikipedia:

The F-test in one-way analysis of variance is used to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other.

The "explained variance", or "between-group variability" is

[math] \sum_{i=1}^{K} n_i(\bar{Y}_{i\cdot} - \bar{Y})^2/(K-1) [/math]

where [math]\bar{Y}_{i\cdot}[/math] denotes the sample mean in the i-th group, [math]n_i[/math] is the number of observations in the i-th group,[math]\bar{Y}[/math] denotes the overall mean of the data, and [math]K[/math] denotes the number of groups.

The "unexplained variance", or "within-group variability" is

[math] \sum_{i=1}^{K}\sum_{j=1}^{n_{i}} \left( Y_{ij}-\bar{Y}_{i\cdot} \right)^2/(N-K), [/math]

where [math]Y_{ij}[/math] is the jth observation in the ith out of [math]K[/math] groups and [math]N[/math] is the overall sample size. This F-statistic follows the F-distribution with degrees of freedom [math]d_1=K-1[/math] and [math]d_2=N-K[/math] under the null hypothesis. The statistic will be large if the between-group variability is large relative to the within-group variability, which is unlikely to happen if the population means of the groups all have the same value.

Also from Wikipedia:

Variance is the expectation of the squared deviation of a random variable from its mean.

[math] \operatorname{Var}(X) = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 = \left( \frac{1}{n} \sum_{i=1}^n x_i^2 \right) - \mu^2, [/math]

where [math]\mu[/math] is the average value. That is,

[math]\mu = \frac{1}{n}\sum_{i=1}^n x_i[/math]
Practical Consequences

It is possible that any calculation of variance using the full sample of our data (layers x city-years) is computationally infeasible. It seems particularly unlikely that we are going to manage between-group variance. I had problems in the past calculating mean distances between all centroids for just hulls, let alone for all geometries! We could, however, do this for some meaningful sub-population.

There is also the question as to whether this approach is sensible in our context. In its native form, we'd be selecting the number of statistical clusters. We could readily use it to select the number of hulls (economic clusters) instead. But, in either case, we'd have to be within-city for this to make sense.

We could normalize the number of clusters, dividing it by the maximum, to deal with the 'cities are different' problem. That is, we could put %unclustered (later called %complete) on the x-axis and %variance explained on the y-axis and fit a curve to a plot of city-year-layers. We could then pick a %unclustered value and apply it across cities. The difference between this and the 'heuristic method' is that we'd be choosing based on diminishing marginal returns in variance explained as opposed to in percentage locations in hulls.


  1. We could do the elbow method on a per city-year basis. The number of statistical clusters is equal to the number of layers, so we'd be indexing over layers, and selecting a layer, for each city-year. It might be worth trying this for some city-year, say Tulsa, 2003. The code would be reusable for a bigger sample. Estimate: 3hrs.
  2. I've rechecked the code and I now think it is computationally feasible. What I was trying to do before was find the average distance between every set of coordinates, which is an order more complex than what we need to do to calculate even within-group variance (between-group variance is simpler). Think O(n) rather than O(n^2), and we have around ~20million statistical clusters spread over ~200k layers. Estimate, given (1) above: 2hrs.
The "Right" Implementation

I could implement the elbow method using scalar distances. Specifically. I could measure [math]\bar{Y}_{i\cdot} - \bar{Y}[/math] as the distance between a cluster's centroid and the overall centroid, and measure [math]Y_{ij}-\bar{Y}_{i\cdot}[/math] as the distance between a cluster's constituent location and its centroid. There's a sense in which this approach is "the right thing to do", and the distance measurements are pretty straight-forward in PostGIS (and would account for an elipsoid earth).

However, in actuality, we have vectors of locations [math](a,b)[/math], and not scalar distances as fundamental inputs. This changes the math[7], as well as the ultimate test statistic [8] that we might use.

Specifically, [math]Y_{ij}[/math] is a vector: [math]\mathbf{Y_{ij}} = \begin{pmatrix} Y_{ija} \\ Y_{ijb}\end{pmatrix}[/math], as are the sample mean [math]\mathbf{Y_{i\cdot}} = \begin{pmatrix} \bar{Y}_{i\cdot a}\\ \bar{Y}_{i\cdot b} \end{pmatrix}[/math] and the grand mean [math]\mathbf{Y} = \begin{pmatrix} \bar{Y}_{a}\\ \bar{Y}_{b} \end{pmatrix}[/math].

So the between-variance, which is called the Hypothesis sum of squares and cross products, is a 2x2 matrix and has K-1 degrees of freedom:

[math] H=\sum_{i=1}^{K} n_i(\mathbf{\bar{Y}_{i\cdot}} - \mathbf{\bar{Y}})(\mathbf{\bar{Y}_{i\cdot}} - \mathbf{\bar{Y}})' [/math]

And the within-variacnce, called the Error sum of squares and cross products, is a 2x2 matrix and has N-K degrees of freedom:

[math] E=\sum_{i=1}^{K}\sum_{j=1}^{n_{i}} ( \mathbf{Y_{ij}}-\mathbf{\bar{Y}_{i\cdot}} )( \mathbf{Y_{ij}}-\mathbf{\bar{Y}_{i\cdot}} )' [/math]

The T sum of squares and cross products is [math]\mathbf{T=H+E}[/math].

This is all potentially problematic because relational databases don't naturally do linear algebra very well. And we likely don't want to do this in some other software for two reasons. First, we'd have to move the data out and back into the database, which is costly. And, second the other software isn't likely to handle an elipsoid earth and the coordinates are not actually on a flat plane. (Though, assuming a flat earth might be reasonable as most cities are sufficiently small that curvature won't materially affect results, no matter how conceptually offensive it is.)

We could calculate the elements of H and E separately in PostGIS, taking advantage of ST_Distance. That wouldn't be too bad as the matrices are only 2x2s. However, it's worth asking whether that would be "right".

Let's look at how H captures information about 'distance' (ignoring flat earth issues). Denote [math]\mathbf{\bar{Y}_{i\cdot}} - \mathbf{\bar{Y}} = \begin{pmatrix} \Delta A \\ \Delta B\end{pmatrix}[/math]. Then:

[math] \mathbf{H}=\sum_{i=1}^{K} n_i \begin{pmatrix} \Delta A^2 & \Delta A \Delta B \\ \Delta A \Delta B & \Delta B^2\end{pmatrix} [/math]

So, the determinant and trace of H are:

[math] \det(\mathbf{H}) = \Delta A^2 \Delta B^2 - (\Delta A \Delta B)^2 \quad \mbox{and} \quad Tr(\mathbf{H}) = \Delta A^2 + \Delta B^2 [/math]

Thus the trace is the square of the distance between points (for H, between a point within a cluster and the cluster's centroid). The trace of a product is the product of the traces only under specific circumstances and not in general, though we likely meet those circumstances (no zero elements).

Standard Test Statistics

We'd like to use some measure of variance explained, but variance is now a matrix. The standard test statistics for MANOVA, which are the closest equivalents to the ANOVA F-statistic from earlier, are:

Wilk's Lambda: [math]\Lambda^* = \frac{\det \mathbf{E}}{\det(\mathbf{H}+\mathbf{E})}[/math]

Hotelling-Lawley Trace: [math]T = Tr(HE^{-1})[/math]

Pillai Trace: [math]V = Tr(H(H+E)^{-1})[/math]

So... the two trace statistics are very close to what we would get if we used scalar distances and used either scalar definition of variance explained. The main difference is the lack of correction for degrees of freedom.

An Opportunity?

We can't find a decent, let alone seminal, reference for using the elbow method to select the number of clusters. Our problem, which uses geographic coordinates, is also a special case anyway. So, we could implement a method using scalar distance and put a description of it, and its relationship to other measures, in the appendix. It might be a good value-added for the paper.

One final thought: We could weight the distance between locations and the mean(s) by the fraction of startups in the location.

The Heuristic Method Justification

An attempt at a paragraph justifying the 'heuristic' method:

Our heuristic method provides an objective technique for picking a city-year layer, which identifies and maps its clusters. It uses two measures. First, we are interested in clusters rather than lines or points, so we measure the percentage of locations in clusters. Second, we want to view each city-year through the same lens. As layer indices are not comparably across city-years, we use the HCA's 'fraction complete' to measure a layer's lens.
As an HCA progresses towards completion, it takes locations out of clusters at an increasing and then decreasing rate. Accordingly, a city-year plot, with the fraction complete on the x-axis and the percentage of locations in clusters on the y-axis, gives an S-curve. The inflection point of this curve marks a conceptual transition between refining clusters and dismantling them.

Note: For each layer of the HCA from i=1 to i=I, we define the HCA's fraction complete as (i-1)/(I-1). The HCA's fraction complete is then zero for the first layer when all locations are in a single hull, and one for the last layer when it has decomposed every cluster into separate locations.

From version 8:

My ‘elbow method’ fits a cubic function to the relationship between the percentage unclustered and the percentage of locations that are in hulls and determines the inflection point. The inflection point finds the layer beyond which further unclustering moves locations out of hulls at a decreasing rate.
As a rough guide, divisive clustering of geographic data of things like startups in cities moves through three stages. In the first stage, the algorithm identifies outliers as points, lines, or small-population hulls. Highest-first layers with low hull counts occur in this stage. In the second stage, the algorithm breaks apart core areas until it achieves the maximum number of hulls. Then, in the third stage, these hulls are refined, providing a progressively tighter lens on the core groupings, and dismantled, until all that remains are points. The elbow method identifies layers in this third stage at the tipping point between refining hulls and dismantling them.

Guzman and Stern

The objective:

"I think less is more. We just want to try to immunize ourselves against a referee who thinks we might be unaware of that data. We need a sentence saying it exists and another sentence saying why we don’t use it. And citing one source is probably enough."

A hard-and-fast two-sentence version, with a one-sentence pillow:

Guzman and Stern (2015) use business registration to find non-venture-backed startups. We do not use this data as it does not have a performance measure to demonstrate the representative layer's selection. However, understanding the relationships between clusters of different types of startups is an exciting topic for future research.

A paragraph that covers the reviewer's comments too:

We use data on venture capital-backed startups as their venture capital investment provides a direct performance measure to identify their clusters. However, other entrepreneurial firms may have clusters that influence clusters of venture-backed firms. In particular, pre-venture capital startups, firms that have experienced an acquisition or an initial public offering, and high-growth startups that do not raise venture capital (see Guzman and Stern, 2015, and related papers) may all affect the agglomeration of venture-backed startups. We leave it to future research to examine the relationships between clusters of different types of startups.

A two-sentence version:

We use data on venture capital-backed startups as their venture capital investment provides a direct performance measure to identify their clusters. However, we expect that other entrepreneurial firms, such as those identified from business registration data in Guzman and Stern (2015), may have a second-order effect on the clustering of venture-backed firms.

We should cite their Science paper:

Guzman, Jorge and Stern, Scott (2015), "Where is Silicon Valley?", Science, Vol. 347, No.6222, pp. 606-609, American Association for the Advancement of Science


@article {Guzman606,
	author = {Guzman, Jorge and Stern, Scott},
	title = {Where is Silicon Valley?},
	volume = {347},
	number = {6222},
	pages = {606--609},
	year = {2015},
	doi = {10.1126/science.aaa0201},
	publisher = {American Association for the Advancement of Science},
	issn = {0036-8075},
	URL = {https://science.sciencemag.org/content/347/6222/606},
	eprint = {https://science.sciencemag.org/content/347/6222/606.full.pdf},
	journal = {Science}
Previous attempts to include Guzman and Stern

From version 8: Finally, Guzman and Stern (2015a,b) and related papers use business registration data to suggest that non-venture high-growth, high-technology startups outnumber their venturebacked counterparts by around three or four to one. These papers map and analyze nonventure startup activity within states, cities, zip codes, and for an individual street. A natural extension to this paper would consider agglomeration economies of non-venture startups, as well as their effects on the agglomeration of venture-backed firms.

From version 4.1: Catalini et al. (2019) suggest that there are high-growth, high-technology startups that are not supported by venture capital. These firms may have been turned down by venture capitalists, may never approach venture capitalists, or may not have had their venture investment recorded at the time the authors conducted their study. Regardless, according to Catalini et al. (2019), non-venture HGHT startups outnumber venture-backed firms by around three or four to one. One obvious question is then whether they enhance or detract from agglomeration economies among venture-backed startups.

A natural extension of this paper would consider the impact of these firms. Catalini et al. (2019) derives its data from the processing of business registration data, which is explored for Massachusetts in Guzman & Stern (2015a), California in Guzman & Stern (2015b), firms leaving Delaware in Guzman (2017), and 15 U.S. states in Guzman & Stern (2016). This data provides full addresses, and so readily lends itself to Geographic Information System mapping, as well as a hierarchical cluster analysis decomposition to create microgeographies.

Comments from the Associate Editor and two reviewers at Management Science

Associate Editor: Multiple referees recommend supplementing your analyses with additional datasets. An expensive but informative path would be the National Establishment Time Series, which has street addresses for every business that reports data to Dun & Bradstreet; this would solve your longitudinal address problem. Alternatively, at no cost you can take advantage of Guzman & Stern’s Startup Cartography project data, which provides counts of the high-potential startups you are probably trying to capture with VX (and also see high-potential startups that don’t raise VC).

Reviewer 2: First, the analysis only considers startup ventures which have already received venture funding—an important component of the ecosystem, for sure, but not necessarily the most important for agglomeration purposes. This results in an incomplete and possibly distorted picture of what agglomeration actually means. if a district has 50 startups that have not yet received VC funding and one that has, would that qualify as agglomeration? And what about companies that have been acquired or have gone public? At a minimum, a more accurate count of active ventures would be required to draw any legitimate conclusions about the effects of agglomeration, along the lines of what Andrews, Fazio, Guzman, Liu, & Stern (2019) have done with the Startup Cartography Project, especially if it is true that “non-venture high-growth, high-technology startups outnumber their venture-backed counterparts by around three or four to one” (p. 5).

Reviewer 4: The authors note the challenges of using VC database addresses. This needs to be more closely considered given the persistence/agglomeration application currently contemplated. If ala Guzman’s work, you have a venture being ported to Redwood City for a big financing round and you also assign all prior rounds to Redwood City, you create quite a biased sample to extra growth around winner places.

Back and Forth

Big things:

  • Q: What is this technique for? A: Endogenous selection!
  • Q: Why cities? A: They are just foundational geographies. We might be able to do this starting with the entire US for a year.
  • Big problems with last version:
    • Ed's choice of language, especially using GIS terms.
    • Lack of focus. The paper does too many things!
  • Get Jim set up on the infrastructure: Wiki account, RDP account, etc.

Other notes:

  • Ed to fix the % unclustered definition.
  • Send Jim Wald method justification

Proposed title:

  1. A new method for estimating the pattern and efficiency of agglomeration with an application to venture-backed startups.
  2. A new method for identifying and mapping spatial agglomeration with an application to venture-backed startups

Note: We had discussions on mapping vs. delineating, startups vs. ecosystem, and other tradeoffs.

4 or 5 value-added points by Ed:

  • Transform the problem of how to find and delineate agglomerations into the problem of selecting the right lens with which to view agglomerations
  • Endogenously determine the location and boundaries of agglomerations
    • Examine agglomeration at any scale, including microgeographic scales (i.e., 10s or 100s of meters). This is important as we think that some sectors, like high-growth high-tech startups, have microgeographic agglomerations.
    • Produce maps of agglomerations so we can examine their characteristics (socio-economics, demographics, physical, etc.) and facilities (e.g., parks, coffee shops, etc.) of agglomerations without including irrelevant areas. The results have policy implications.
  • Allow estimation of continuous changes in spatial characteristics on agglomeration economies.
    • Determine the equilibrium, where costs (from increased competition over scarce resources) and benefits (from reduced transportation costs) are in balance, and understand its policy implications.
    • Conduct policy simulations by demarking some area as an innovation district and exploring the effects of relocating startups into it.
  • In the context of venture-backed startups, agglomeration economies are as powerful as scale effects but easier and cheaper to achieve with policy.

The revision will be built around a public-policy hook, which pervades these points.

Questions and Answers

  • What's the method: Ward's method in reverse. Always add another cluster!
  • How do we select the best layer for our purpose:
    • Our purpose is to identify/delineate agglomeratons within a CDP
    • We use lowest-highest layers, which correspond to unique hull counts with 'stable' hulls, in a regression at the CDP level, with a performance measure (next year's venture investment) as the dependent variable.

Suggestions and Questions

Terminology - Match to audience but accurate.

  • Propose: Hulls -> Clusters. Issues: mixes statistical def, also what about boundary? Alternative: Area.
  • Census place -> cities.
  • Lowest-highest -> representative
  • Understand regression to select hull count
  • Heat map!

References for things we used:

To do the HCA we used the AgglomerativeClustering method from the sklearn.cluster library (version 0.20.1) in python 3.7.1, with Ward linkage and connectivity set to none. This method is documented here: https://scikit-learn.org/stable/modules/clustering.html. I checked some of the early results against an implementation of Ward's method using the agnes function, available through the cluster package, in R. https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/agnes

The data was assembled and processed in a Postgresql (version 10) database using PostGIS (version 2.4). We used World Geodetic System revision 84, known as WGS1984 (see https://en.wikipedia.org/wiki/World_Geodetic_System), as a coordinate system with an ellipsoidal earth, to calculate distances and areas (see https://postgis.net/docs/manual-2.4/using_postgis_dbmanagement.html). Shapefiles for Census Places were retrieved from the U.S. Census TIGER (Topologically Integrated Geographic Encoding and Referencing) database (see https://www.census.gov/programs-surveys/geography.html).

The statistical analysis was done in STATA/MP version 15.

New Target Journals

There is a section of Wikipedia page on "New Economic Geography" that is worth reading, though it seems very out of context...

Impact measured using H-index

Journals (Ed's prefs in bold):

Also, we might want to pull a Scott Stern and try:

  • Science -- H-Index: 1124 [20]; Times: 2.0 months to 1st round and 5.3 months to completion based on 54 reviews[21]; Homepage: https://science.sciencemag.org/; Other: up to ~4500 words. Can have additional supplementary materials.

Changes to methodology

We changed the PCA to use just 8 variables and we then took just 3 components:

pca nosinglemulti nopair nohull totsinglemulticount totpaircount tothullcount tothullarea totpairlength
Principal components/correlation                 Number of obs    =    216,242
                                                Number of comp.  =          8
                                                Trace            =          8
   Rotation: (unrotated = principal)            Rho              =     1.0000
      Component |   Eigenvalue   Difference         Proportion   Cumulative
          Comp1 |      3.22169      1.09811             0.4027       0.4027
          Comp2 |      2.12359      1.11652             0.2654       0.6682
          Comp3 |      1.00707      .147009             0.1259       0.7940
          Comp4 |      .860061      .199002             0.1075       0.9016
          Comp5 |      .661058      .550229             0.0826       0.9842
          Comp6 |      .110829      .101626             0.0139       0.9980
          Comp7 |    .00920377    .00270352             0.0012       0.9992
          Comp8 |    .00650025            .             0.0008       1.0000
Principal components (eigenvectors) 
       Variable |    Comp1     Comp2     Comp3     Comp4     Comp5     Comp6     Comp7     Comp8 | Unexplained 
   nosinglemu~i |   0.1763    0.6213    0.1362   -0.2153    0.1628   -0.0441    0.4473    0.5380 |           0 
         nopair |   0.4968    0.0588   -0.1088    0.4258   -0.1916    0.0992    0.5652   -0.4366 |           0 
         nohull |   0.4613   -0.2494    0.1004   -0.3290   -0.1480   -0.7638   -0.0352   -0.0087 |           0 
   totsinglem~t |   0.1948    0.6196    0.1305   -0.1813    0.1416    0.0103   -0.4529   -0.5504 |           0 
   totpaircount |   0.4932    0.0711   -0.0971    0.4262   -0.2260    0.1408   -0.5224    0.4656 |           0 
   tothullcount |   0.3642   -0.2913    0.2530   -0.5627   -0.1425    0.6171    0.0259   -0.0001 |           0 
    tothullarea |  -0.0543   -0.0905    0.9248    0.3532    0.0797   -0.0494    0.0035   -0.0016 |           0 
   totpairlen~h |   0.3041   -0.2575   -0.1175    0.0902    0.9039    0.0391   -0.0224    0.0119 |           0 

From the eigenvectors, we can see that (give or take):

  • Component 1 is driven by the number of lines and hulls, as well as the number of startups in lines and hulls
  • Component 2 is driven by the number of points and the number of startups in points
  • Component 3 is driven by the total hull area

Previous Version

Target Journal

High-end options might include:


The paper was submitted to Management Science under Toby Stuart on the 26th March 2020. Toby immediately requested a change to the framing and the paper was resubmitted on April 6th. It was then sent out for review. I think I drew Josh Gans as an A.E. The paper was rejected on June 25th with the following note from Toby:

I have received four reviews and an Associate Editor report on your paper. I am very sorry to inform you that three of the four reviewers and the AE recommend against proceeding with the manuscript. All readers find this to be a really thought-provoking piece of work and I'd say there is a consensus that it falls into the "valiant try" category. it presents a novel idea and it got everyone thinking, but in the end, it is also is too much of a kluge and too difficult to follow to proceed with a revision, given the reviewers' doubts that the project is one round of revision away from a successful outcome.


New files are in E:\projects\agglomeration

  • agglomeration.sql -- new main sql file for vcdb3
  • AgglomerationVcdb4.sql -- even newer main sql file for vcdb4

Contents of the Old Code subdirectory:

  • Agglomeration.SQL -- This was the replacement code that was run when vcdb3 was made. We had to work through it to update the counts. Came from E:\projects\vcdb3\OriginalSQL\Agglomeration.SQL
  • Agglomeration.sql -- Old version that makes colevel data for python script. Came from E:\mcnair\Projects\Agglomeration
  • Analysis.sql -- Old version builds from HCL forward. Came from E:\mcnair\Projects\Agglomeration

See also:


The Vcdb4 rebuild is done with the script AgglomerationVcdb4.sql in E:\projects\agglomeration

It needs:

  • portcomaster -- built in new BuildBaseTables
  • tiger2 -- built in AgglomerationVcdb4.sql from TigerModified, which combines 7 encapsulated places with their encapsulators (see section below).
  • portcogeo -- built in load tables and modified there to create lat4 and long4, which have 4dp (11.1m) of precision, which wikipedia lists as a land parcel and roughly corresponds to an office front.
  • roundsummary ->placeyearvc -- Roundsummary doesn't exist anymore. We'll build placeyearvc from a different base
  • rankingfull -- Requires deciding how to deal with places (see below).
  • Cpi - done in Load.sql

When doing the synthetics for Houston, we need a central location and hull size. The 2007 hull centroid was at (long,lat=-95.47029505572422,29.745008286648368). The WeWork in The Galleria on Westheimer is at (-95.4765668, 29.7256031). We'll use a scaling hull size of 1 hectare per startup. For a 25 hectare district, we would do sqrt(25)/2 as the plus-minus in hm, where 1 hm is 0.0000898 degrees (at the equator for latitude). Or more broadly, the degree +/- for an x hectare district is:

+/- = ((sqrt(x)/2)*0.0000898)

Other notes:

  • This build also attempts to incorporate American Community Survey (ACS) Data.
  • The level code was taken out of the main SQL file and moved to ExcessCode.sql.
  • The elbow calculation was done using elbowdata, not elbowdatarestricted. The workings are in E:\projects\agglomeration\Elbow.xlsx. Plot fracunclustered (X) against avgfracinhull (Y), and fit a cubic to give: 2.8485 x^3 - 3.3892 x^2 - 0.43^x + 1.0373≈-0.0553263 at x≈0.425704 [24]
  • The Chosen Hull Layer can't be done until after the STATA analysis has been run, so for the first pass these tables are created empty. The code will have to be re-run from line 1200 onwards to add this in, recomputer the layer distances, and update the master dataset later.
  • The TIF data processing was taken out of the main file. It was moved to E:\projects\agglomeration\TIF\TIF.sql.
  • The old Houston code was largely redundant after the rebuild, but it was taken out to ExcessCode.sql, as was the diagnostic code that followed it.

Still to do:

  • ACS variables - Just needs a join!
  • Run and add in chosen layers!

Analysis and Exploration

Choose LHS variable:

  • Growth VC: Growthinv17lf -- Rather than seed, less shocky. Forward one period. In 2017 or 2018 dollars to make it real. Logged to normalize somewhat and give a 'change' interpretation
  • Growth in growth VC: gg17pcl -- Definitely logged (it's wildly non-normal without and somewhat normal with), even though this gives an change in growth rate interpretation. Could winsorize but logging pretty much deals with outliers anyway. Forward less present over present, so still forward looking.
  • We could also conceivably use the change in rank: rankchg (=f.overallrank-overallrank) or rankup (=overallrank-f.overallrank). It would be much 'smoother' than other measures. It's also pretty normal (if a little sharply peaked at 0). However, it would be relative performance measure!

Choose dataset:

  • Restrict on numlayers. Those with <3 layers can't form hulls. 3 layers is therefore a minimum for a hull based analysis. 6 layers allows 2 hulls, in theory anyway, and doesn't. 6,746 place-years have 2 or more layers (max is 1317), 4,969 have 6 or more layers.
  • Restrict on time period. Data is from 1980 to present, through note that we have our synthetic Houstons which have years from 1 to 65 to indicate how many locations were replaced. We might want to restrict to 1986 to present to get good coverage. Or from 1995 to present to do the 'modern era'. Note that 2019 is a half year, so we should probably through it out.
  • We are already restricting to the 198 places that had greater than 10 active at some point in their history...

Choose scale regressors:

  • growthinv17l numdealsl numstartupsl

Then the analysis will be loosely as follows:

  • Non-layer dependent descriptives
  • Explain layers using Burlington, VT?
  • Explain outliers using Las Vegas, NV?
  • Highest1Hull
  • Show Boston? at the elbow for 2018, 2017, 2016
  • Elbow
  • Layer Dependent descriptives I
  • Levels (picking layer by average hull size)
  • Maximum R2
  • Layer Dependent descriptives II
  • Group Means
  • Policy Simulation

It would be good to include some MSA level or broader ecosystem level variables. These could include:

  • Scale of MSA?
  • Boston-Cambridge, Research Triangle, Silicon Valley, North East Corridor, LA, Bay Area, Seattle Area, and other notable indicators.
  • Some dbase calculated measures:
    • Number of adjacent places in placetigerarea (or similar)? Create a 10km buffer and look for intersections?
    • Chain places together and count them?

The Between Estimator just isn't appropriate. We do want fixed effects (we'll test to see if random effects are appropriate - they aren't). See E:\projects\agglomeration\microeconometrics-using-stata.pdf p254 to p262, as well as this www.statalist.org thread.

Also the maximum r2 approach picked a number of hulls, which should have been constant (it is now - there were errors before) within a place, making it redundant. Note that STATA will run the variable if you include the fixed effect second but not first - try:

reg something else i.control
reg something i.control else

This problem doesn't occur when using xtreg, which also appears to correctly cluster the standard errors.


The 'highest1hull' (or 2 or 3) is a specification that identifies the highest level that has 1 hull the first time (i.e., starting from level 1, go until the level has 2 hulls and back it up one, or if it doesn't have 2 hulls, just find the highest layer with 1 hull.

As a consequence we can't use nohull, as it is always 1. It also turns out that tothullcount and tothulldensity don't matter, likely because we already include numstartupsl (removing it makes tothullcount significant). avgdisthm, however, does work.


We tried the following 'levels' (See Levels.xlsx):

Name Tgt. Avg. Hull Size Act. Avg. Hull Size Act. Avg. Hull Count Coef. Sig Act. Density
HCLLayerLevel1 2 1.390169 3.31986 0.031994 0.418743
HCLLayerLevel2 5 3.828996 4.213987 -0.00471 0.90864
HCLLayerLevel3 10 7.921853 4.949919 -0.04308 1.600401
HCLLayerLevel4 15 12.38832 5.483262 -0.00495 2.259297
HCLLayerLevel5 20 16.82407 5.879045 0.057642 2.861701
HCLLayerLevel6 25 21.07783 6.218925 0.13614 * 3.389304
HCLLayerLevel7 30 25.40073 6.512421 0.174672 ** 3.900351
HCLLayerLevel8 35 29.48548 6.788268 0.155578 ** 4.343594
HCLLayerLevel9 40 33.6133 7.02369 0.158089 * 4.785704
HCLLayerLevel10 50 41.93593 7.452783 0.163425 * 5.626882
HCLLayerLevel11 35000 5080.886 37.30932 0.3621 136.1828

From this, the optimum density -- the point at which the change in total density gives no further growth effect is probably between 2 and 3 startups per hectare. We will split the difference and use 2.5 startups per hectare as the optimum for the our innovation district simulations. Given than 0.001 decimal degrees is 111.3199m, there are 0.000898 degrees/hm [25]. We'll use square boxes, so our corners will be:

coord +/- ( ( (sqrt(x)/2)*0.000898) / 2.5)

Note that we had one too many decimal places in the earlier version by accident, which is why everything looked really weird!


The elbow uses the point of inflection between the fracunclustered and fracinhull in the base table hcllayerwfinal (i.e. across all place, statecode, year, layer). This inflection point is at 0.425704 fracunclustered -- which is layer/finallayer.

2.8485 x^3 - 3.3892 x^2 - 0.43^x + 1.0373≈-0.0553263 at x≈0.425704 (y=0.459799)

Within cities, and with year fixed effects, the only things that matter are the number of singleton/multitons and the number of hulls. The more singletons/multitons, the greater the growth; and the more hulls, the lower the growth. This result goes away if we control for the full set of geographic characteristics. Without city fixed effects, only avgdisthm is significant (and negative).

Maximum R2

The Maximum R-squared analysis relies on finding layers by hull count. Specifically, for a certain hull count, say 2, we find the lowest-highest layer occurrence. That is we first find the last layer (i.e., the highest layer) that had a hull count of 2, then if that occurrence is a part of a sequence, we find the first time that in the sequence.

Possible examples include Portland, OR (in say 2018), or Burlington, VT in 2015

SELECT * FROM HullsBase WHERE place='Burlington' AND statecode='VT' AND year=2015;
place	statecode	year	layer	numclusters
Burlington	VT	2015	1	1
Burlington	VT	2015	2	1
Burlington	VT	2015	3	1
Burlington	VT	2015	4	2
Burlington	VT	2015	5	2
Burlington	VT	2015	6	1

For Burlington, the lowest-highest layers are 4 for hull count 2 and 6 for hull count 1. Note that there is presumably a layer 7 with hull count 0.

We want to regress, within each city, our measures on performance, and then select the lowest-highest layer -- in effect the number of hulls -- which maximizes the R-squared of the regression. We are going to use 1995 to 2018 inclusive, which is 24 years. We want to include scale variables (i.e. 3 vars: growthinv17l numdealsl numstartupsl), and the (almost full -- we can't do avgdisthm) set of explanatory vars (8 vars nosinglemulti nopair nohull totsinglemulticount totpaircount tothullcount tothullarea totpairlength), which would total 11 vars plus an intercept. To reduce this, we can do a PCA first.

The alternative method is to create residuals from the outcome variable (growth vc) after taking out the variation explained by the scale effect. We can then use these residuals as the outcome variable in the city-by-city regressions (again, perhaps with a PCA). The advantage to this method is that we will pick layers based on their R2 of just the agglomeration effect, rather than agglomeration and scale together.

reg growthinv17lf growthinv17l numdealsl numstartupsl i.year i.placeid if lowesthighestflag==1 & year>=1995 & year <.
predict growthinv17lfres, residuals

Using the first method, the PCA yields 3 or maybe 4 components. We'll use 4.

Then we need to create a new dataset with placeid year as the panel, so that we can run between regressions. One issue was that some observations have a missing avgdisthm even when layer=chosenhulllayer. This was because in some cases there was only one geometry. In a similar vein, avghulldisthm is the distance between hull geometries, not within hull geometries, so is missing when there are zero or one hulls. In both cases, we flag the instances (using variables onegeom, onehull, zerohull) and then replace missings with zero (as this is conceptually correct).

Instrumental Variable(s)

See https://en.wikipedia.org/wiki/Instrumental_variables_estimation for background.

We want a variable that affects the amount of venture capital only through its effect on agglomeration. More specifically, in the case of the highest1hull, we want a variable that affects the next periods VC only through its effect on the hull size (given that there is always one and only one hull). Moreover, we will be using city-level fixed effects, so a variable that doesn't change over time -- like the size of the city -- will have no variation to drive an effect on hull size. There are two immediate possibilities, using a size measure involving the TIFs or using the number of locations within TIFs. The points measure has two opposing effects in it: it grows as more startups move into TIFs and it grows as TIFs take up more area. The size measure(s) could be the area of TIfs intersecting last periods hull and the area not intersecting it.

hull area in t = f( hull area in t-1 intersecting TIF areas in t, TIF areas in t not intersecting hull area in t-1).

The two measures will have temporal variation as TIF areas vary over time and as hull areas vary over time...

Instrumenting highest1hull

The instrument on highest1hull needs to work in itself, but we also need a consistent estimate of an effect for the population and for the 13 (or 10, etc.) cities that we can instrument before we start.

It looks like including year fixed effects in the 'standard' highest1hull spec is just too much with the reduced sample size. We can use a boom indicators though:

reg growthinv17lf growthinv17l numdealsl numstartupsl avghullarea boom i.placeid if reg1==1 & year95==1, robust
reg growthinv17lf growthinv17l numdealsl numstartupsl avghullarea boom i.placeid if reg1==1 & year95==1 & tifs==1, robust

reg avghullarea tifintareahm boom i.placeid if reg1==1 & year95==1 & tifs==1, robust

tab placeid if tifs==1, gen(tifcity)
ivreg2 growthinv17lf growthinv17l numdealsl numstartupsl boom tifcity* (avghullarea=avghullarea_L) if reg1==1 & year95==1 & tifs==1, robust endog(avghullarea)

Useful documentation:

And later additions:

Results are as follows:

  • In the first stage, the instrument is highly (positively statistically significant).
  • Without the instrument my estimate is small, negative and sig at 5%.
  • With the instrument, the estimate is small, positive and insig.
  • Underidentification test is significant, H0 is that model IS underidentified, so this is rejected p486
  • Weak ID test has a Cragg-Donald Wald F statistic of 58.713, which is well above the critical bounds
  • Overidentification test is 0 (null is identified), and it says "equation exactly identified"
  • Endogeneity test is 3.375 with Chi-sq(1) P-val = 0.0662. Null is that it is exogeneous, which is rejected.

. ivreg2 growthinv17lf growthinv17l numdealsl numstartupsl boom tifcity* (avghullarea = tifintareahm) if reg1==1 & year95==
> 1 & tifs==1, robust endog(avghullarea)
Warning - collinearities detected
Vars dropped:       tifcity12

IV (2SLS) estimation

Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity

                                                      Number of obs =      179
                                                      F( 16,   162) =    14.82
                                                      Prob > F      =   0.0000
Total (centered) SS     =  412.5880393                Centered R2   =   0.5846
Total (uncentered) SS   =  3128.446788                Uncentered R2 =   0.9452
Residual SS             =  171.3895217                Root MSE      =    .9785

             |               Robust
growthin~7lf |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 avghullarea |   7.13e-06   8.44e-06     0.84   0.399    -9.42e-06    .0000237
growthinv17l |   .2036359   .1261974     1.61   0.107    -.0437064    .4509783
   numdealsl |  -.0775574   .1797158    -0.43   0.666    -.4297939    .2746791
numstartupsl |   .8311169   .2834661     2.93   0.003     .2755335      1.3867
        boom |   1.063169   .2712561     3.92   0.000      .531517    1.594821
    tifcity1 |   .3257855    .401005     0.81   0.417    -.4601698    1.111741
    tifcity2 |    .055352   .6979564     0.08   0.937    -1.312617    1.423321
    tifcity3 |  -.6024319   .7152221    -0.84   0.400    -2.004241    .7993776
    tifcity4 |   .3234907   .4568871     0.71   0.479    -.5719915    1.218973
    tifcity5 |  -.2383231   .4824386    -0.49   0.621    -1.183885    .7072392
    tifcity6 |  -.2941022    .398834    -0.74   0.461    -1.075802    .4875981
    tifcity7 |  -.8108876   1.113175    -0.73   0.466    -2.992671    1.370896
    tifcity8 |   .2262667   .5042398     0.45   0.654    -.7620251    1.214558
    tifcity9 |  -.0979371   .4359115    -0.22   0.822    -.9523079    .7564337
   tifcity10 |  -1.042832   .6154755    -1.69   0.090    -2.249142    .1634775
   tifcity11 |  -.3423386   .4180955    -0.82   0.413    -1.161791    .4771135
   tifcity12 |          0  (omitted)
       _cons |  -.0235353   .9222309    -0.03   0.980    -1.831075    1.784004
Underidentification test (Kleibergen-Paap rk LM statistic):             17.585
                                                   Chi-sq(1) P-val =    0.0000
Weak identification test (Cragg-Donald Wald F statistic):               58.713
                         (Kleibergen-Paap rk Wald F statistic):         15.579
Stock-Yogo weak ID test critical values: 10% maximal IV size             16.38
                                         15% maximal IV size              8.96
                                         20% maximal IV size              6.66
                                         25% maximal IV size              5.53
Source: Stock-Yogo (2005).  Reproduced by permission.
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
Hansen J statistic (overidentification test of all instruments):         0.000
                                                 (equation exactly identified)
-endog- option:
Endogeneity test of endogenous regressors:                               3.375
                                                   Chi-sq(1) P-val =    0.0662
Regressors tested:    avghullarea
Instrumented:         avghullarea
Included instruments: growthinv17l numdealsl numstartupsl boom tifcity1
                      tifcity2 tifcity3 tifcity4 tifcity5 tifcity6 tifcity7
                      tifcity8 tifcity9 tifcity10 tifcity11
Excluded instruments: tifintareahm
Dropped collinear:    tifcity12

Note that using lagged avghullarea as an instrument seems to work... Results are as follows:

  • With the instrument, the estimate is close to the original, still negative and sig at the 5% level.
  • Underidentification test is significant, H0 is that model IS underidentified, so this is rejected
  • Weak ID test has a Cragg-Donald Wald F statistic of 679.195, which is massively above the critical bounds
  • Overidentification test is 0 (null is identified), and it says "equation exactly identified"
  • Endogeneity test is 0.037 with Chi-sq(1) P-val = 0.8465. Null is that it is exogeneous, which we can't reject.

Overall, though, it looks like the TIF instrument is dead. Which is a HUGE shame.


It looks like a previous version of this results section wasn't saved somehow...

The crucial information is the magnitude of effect of coefficients:

  • Log-level: (e^B -1)*100, gives pc change in Y from 1 unit change in X [26]
    • Except with indicator variables, when the positive change is (e^B -1)*100 but the negative change is (e^(-B) -1)*100 [27].
  • Log-log: ((1.01^B)-1)*100 [28]

Note that some guides appear to contain errors [29]!

We are working off of the besthulllayer regression coefficient between:

. xtreg growthinv17lf numdealsl numstartupsl nohull frachull tothullareal avgdisthml onegeom i.year if reg6==1, be

Between regression (regression on group means)  Number of obs     =      3,635
Group variable: placeid                         Number of groups  =        198

R-sq:                                           Obs per group:
     within  = 0.0000                                         min =          2
     between = 0.9010                                         avg =       18.4
     overall = 0.0189                                         max =         23

                                                F(29,168)         =      52.73
sd(u_i + avg(e_i.))=  .4628413                  Prob > F          =     0.0000

growthin~7lf |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   numdealsl |  -.7000722   .2198263    -3.18   0.002     -1.13405   -.2660944
numstartupsl |   2.125103   .2104279    10.10   0.000     1.709679    2.540527
      nohull |  -.0154989   .0094233    -1.64   0.102    -.0341023    .0031045
    frachull |   .2890439   .2398689     1.21   0.230    -.1845017    .7625894
tothullareal |  -.0726595   .0423528    -1.72   0.088    -.1562717    .0109526
  avgdisthml |  -.2554753   .0868682    -2.94   0.004    -.4269693   -.0839814
     onegeom |  -1.217626   .4010889    -3.04   0.003    -2.009449   -.4258019
        year |
       2017  |   11.83234   3.830136     3.09   0.002     4.270947    19.39374
       _cons |  -3.687907   3.026614    -1.22   0.225    -9.663005     2.28719
1 unit change gives pc change in Y 1 pc change gives pc change in Y
numdealsl -0.69417
numstartupsl 2.137063
nohull -1.537941017
frachull 33.51503403
tothullareal -0.07227
avgdisthml -0.25388
onegeom -70.40681241


Note: Make sure that the geocoding has been fixed first! See Restoring_vcdb3#Fix_the_Geocoding. Note that we are working with decimal degrees to six decimal places from Google Maps, which is equivalent to a staggering 11cm of accuracy at the equator. See http://wiki.gis.com/wiki/index.php/Decimal_degrees.

The build is as follows:

  • portcomaster 48001 (where hadgrowth=1 34761) and portcogeo (47715 where latitude and longitude NOT NULL 47715) -> colevelsimple 33869
  • colevelsimple 33869-> CoPoints (adds point geoms)
  • tigerplaces 29322 (has place geoms) -> tiger2 (adds areas)
  • tiger2,CoPoints ST_Intersects -> colevelbase 29442
  • colevelbase,years -> colevelblowout 219060, CoLevelSummary 35344, PlacesWithGT10Active 200
  • colevelblowout,PlacesWithGT10Active -> CoLevelForCircles 171170


Take CoLevelForCircles.txt and feed it into the HCA script in


See also:

Take results.tsv and load it as the HCL table.

Layers and levels

See Agglomeration.sql for the following build:

  • hcl loaded from results.tsv 29751998
  • Determine if singleset, multiset or hullset (fully partitions data)
  • Make hclsingletons, hclmultitons and hclhulls (which contains both hulls and lines)
  • UNION them together to create hclmain (14875999), which contains geographies
  • Make hcllayer 163887: Aggregate to the layer level calculating nosingleton, nomultiton, etc., as well as tothullarea etc.

Note that everything uses Geographies (except to find centroids), and pair lengths and hull areas are scaled so that they are in hm and hectares (hm2) [30], as this is close to being a city block length and square block [31] at least in Houston, TX (note that a block in Chicago is 2 Houston blocks and there is no standard block definition).

Then load up leveldefinitions.txt. Note that we are using more levels than before, with finer grained levels at the bottom end:

Level	Label	Target
1	100msq	0.01
2	1blfront	0.1
3	1blsq	1
4	5blsq	5
5	10blsq	10
6	25blsq	25
7	50sqbl	50
8	1kmsq	100
9	5kmsq	400
10	10kmsq	1000
11	20min rule	35000

Then build the nohulls as level 0 and the allhull as level 12. Note that we are going to have to throw out panels or observations with too few points per city-year later, as these can have singletons, multitons, or pairs as their allhulls. This can be done with nohull !=0. Also nohulls can be built using hcllayer then it will contain pairs, or from colevelforcircles. I opted for the later, so that it only contains singletons and multitons.

Finally in this part, build hcllevels and hcllayerwzero. For hcllevels we are going to compute mean distances between clusters. It is computational infeasible to do this for all layers. And then for all layers (inc zero) we are going to run our selection regression.

For the next steps on the data see Jeemin Sim (Work Log). This includes details of how to load the TIF data.


We want to choose some layers to work with. https://en.wikipedia.org/wiki/Hierarchical_clustering notes that "One can always decide to stop clustering when there is a sufficiently small number of clusters (number criterion). Some linkages may also guarantee that agglomeration occurs at a greater distance between clusters than the previous agglomeration, and then one can stop clustering when the clusters are too far apart to be merged (distance criterion)."

In a similar vein, https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set describes the elbow method, using AIC/BIC, etc., as well as an Information theoretic approach, the silhoette method, etc.

For us:

  1. Discarding outliers
  2. Elbow on fraction of locations in hulls
  3. Chosen by the researcher
  4. Maximum R-Squared

We also looked at:

  • Elbow on fraction of maximum hull area in hulls

And finally, we need to think about:

  • Reasonable exclusions

Discarding Outliers

We don't need to discard outliers, per se, just find a layer where outliers are singletons. A wrong approach is to take the highest layer with a single hull (or two hulls or three hulls, etc.). It is fair that if a layer never has a hull, then presumably it only has a single location or a line of locations (note that it is possible for a line to have more than 2 locations both because of multitons and because of perfect alignment, given our Google Maps accuracy), so we can discard it. However, this approach will find when there is just one hull left, rather than the last time that there is one hull in decomposition.

Possible options and issues:

  • Find when there are first two hulls and then step back a layer -- but there might never be two hulls, so if there is only ever one hull then find the max layer.
  • Form a chain from layer 1 on down that breaks when there is no longer just one hull. Perhaps count the number links, grouping where the chain has one hull or not, or require that the chain contain level 1... [32]

We went with the first option. The base table for this approach is hcllayer. The variables are highest1hulllayer, highest2hulllayer, and highest3hulllayer in the highesthulllayer table.

It is worth noting that the highest1hulllayer occurs on average at around 35.9% unclustered (with std dev. of 48.0%), but this falls to just 12.8% when we exclude years where the highest1hulllayer is also the last and hence only layer. These percentages go down slightly for highest2hulllayer and highest3hulllayer because cities that have 2 or 3 (or more) hulls have larger ecosystems and so more layers.

Elbow on fraction of locations in hulls


The elbowcalc and elbowdata queries provide the data. elbowdata takes layer/finallayer (i.e., fraction unclustered, as the layer 1 is the all encompassing hull and final layer is the raw locations), rounds it to two digits, and then calculates the average fraction of locations in hulls and the average hull area fraction of all encompassing hull area. The former gives a nice curve with an elbow (found by taking the second derivative and setting it equal to zero) at x=0.40237.

We then identify the layer that is closest to having a fraction of locations in hulls of 0.40237, taking the lower level (i.e., the more clustered level) whenever there is a tie. The resulting indicator variable is called elbowflhlayer and is made in table Elbowflh. This is analyzed in a sheet in "Images Review.xlsx" in E:\projects\agglomeration.

Fraction of Maximum Hull Area in Hulls


We also tried computing the fraction of the maximum hull area (MHA) in covered by hulls for each layer. The maximum hull area is on layer one, when every location is in an all-encompassing hull. We excluded data from layer one as well the final layer because they lead small data issues.

A cubic was a mediocre fit to this data, giving an R2 of 83% but with lots of deviation concentrated right around the local minimum ({-0.0224722, {x -> 0.446655}} [33], point of inflection and local maximum. A quartic had an R2 of 90% at around x=0.44 (6.408 x^4 - 15.176 x^3 + 12.592 x^2 - 4.3046 x + 0.517≈0.00825284 at x≈0.440275). I tried a quintic and it had inflection points are x=0.33, 0.55, and 0.82, as well as local maxima at 0.39 and 0.90. Visually there seems to be something going on in the 20% to 40% uncovered range too, perhaps a bifurcation of results, which might be due to rounding issues.

Reasonable Exclusions

We started by including all U.S. cities that received at least $10m of growth venture capital in a year between 1980 and 2017 (inclusive). This gave us a list of 200 cities. However, we still have a lot of city-years with low number of startups.

What is a reasonable number of startups to analyze agglomeration? Three locations (which is at least three startups) is the bare minimum required for one hull without excluding outliers. And we only made images for places with 4 or more startups. A visual inspection suggests that while there is greater (relative) dispersion when counts are low, it isn't hugely problematic. It is also worth noting that excluding 4 or less would get rid of Farmer's Branch, Fort Lauderdale, and Tempe (and Bloomington, MN) in 2017, and 6 or less in 2017 would eliminate Cary and Addison, all of which are slightly problematic. Burlington, VT has 7 years in the data with more than 6 startups, and one with 6.

But everywhere (i.e., all 200 places) have 10 or more layers at some point in time. And everywhere has at least 6 years with 6 or more observations. Detroit has just 7 obs that meet this criteria, half the number of Germantown, MD and a third of Greenwood Village, CO.! Requiring a year to have six observations would reduce us to 4916 observations from 6702 (i.e., down to 73% of the data). Requiring 9 would reduce the data down to 3889 obs (58%), and we'd lose more observations as places wouldn't have enough to form a time-series. The answer then appears to be to limit to observations with 6 or more layers. We'll code the number of layers, and the max and min number of layers for a place, into the data.

Maximum R-squared


Using a maximum R-squared approach to find the 'best layer' for a city is inherently problematic. A city might have 5 layers in 1980 and 80 layers in 2017, and so using layer 40, say, irrespective of year is somewhat meaningless. There are several alternative that make more sense. One is to use the fraction unclustered, much like with the elbow approach. The other is to find the layer with a certain hull count (or as close to it as possible). Hulls might tend to be somewhat stable over time, so three hulls in Portland in 2017 will be centered in more or less the same place as three hulls in Portland in 2003. This turns out to be somewhat true, as seen in the image on the right, which uses the last time (highest level) that there are three hulls, or two for 1998 and 1993 (one of which is out of frame). One issue with this approach is that the highest level with a certain hull count is that hulls almost always contain just three points.


An obvious alternative approach is to use the first time (lowest level) that there are three hulls. There is a big difference in the layer numbers for this. See the queries below. Essentially, the HCA algorithm often takes an original hull apart and then takes the resulting hulls apart, giving a quadratic for hull count again layer. But, as is apparent in the image, this leads to big areas that would only be good for overlap analysis and not for identifying individual clusters.

--Lowest,highest, and lowest-highest, and first-after-peak level where there are three hulls
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2015' ORDER BY year,layer;
--3, 41, 37, 33 (peak at 22,23,24,25)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2010' ORDER BY year,layer;
--3, 24, 24, 24 (peak at 18)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2005' ORDER BY year,layer;
--7, 23, 21, 21(peak at 18)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2000' ORDER BY year,layer;
--3, 17, 17 , 17 (peak at 12,13,14)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='1995' ORDER BY year,layer;
--1(1), 8(1), 8(1), 8(1) (no peak-just flat but still take the highest layer with the nearest lower number of hulls)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='1990' ORDER BY year,layer;
--2(2), 5(2), 5(2), 5(2) (peak at 2,3,4,5 so 5 is first 'after' peak? but still take the highest layer with the nearest lower number of hulls)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='1985' ORDER BY year,layer;
--1(1), 1(1), 1(1), 1(1) (peak at 1 as that's the only layer with a hull but still take the highest layer with the nearest lower number of hulls) 


Two further options are to find the lowest-highest and the first-after-peak. These often coincide. The lowest-highest finds the highest layer with x hulls and then works back down the layers taking the lowest in the continuous chain of x hulled layers. The first-after-peak finds the first layer with x hulls in or after the layers where there is the peak number of hulls. This last approach is a little problematic because sometimes there isn't a peak - its just flat - and it is inconceivable that there could be two or more peaks.

Computing the Lowest-Highest, running a PCA (see below) and using three factors in a regression to store R2, N, and adjusted R2, gives the results in the R2 On Hulls sheet of Images Review.xlsx. The first 20 placeIDs were checked (i.e., 10% of the sample). In most cases (16/20), the maximum R2 and R2-Adjusted coincided when considering only cases when N>=10. Note that R2=1 and R2 adjusted is missing for all cases where N <=4, and there there is very high volatility in both measures for N<10.

There were some draws out of scope (i.e., with N=9), and we should take the lowest(?) layer in the event of a draw (they will have the same N with prob ~1). There are some cases to look at:

  • 10 - Austin (3885 layers): Came up 3 when it could have been anything up to 23 (with N>=10). Looking at a map, this might be ok. Austin's has some fairly homogenous big clusters in 2017.
  • 9 - Atlanta (1414 layers) came up 8/8(N>=10) on R2 and 1 on adjusted R2, which was the biggest spread. All other discrepancies were of a single layer. It is hard to see either answer in the 2017 image, which looks like it has 2 clusters, or maybe 3 or 4. The r2 does jump at level 8 (to 0.50 from 0.28 in the layer before, with the previous highest being 0.39 at level 1), but level 8 has N=13 and is the second highest R2adj at 0.327 as compared with 0.332 in level 1.
  • 13 - Bellevue (1181 layers) had R2 and R2adj in agreement at level 2, but there were 7 levels with N>=10 and 13 levels over all. 2 looks ok on the map in 2017 (3 or 4 would likely be better). I expect that this is a case of a place that had dramatic growth...
  • 6 - Alpharetta (580 layers) comes it at 1 for both R2 and R2adj even with N<10 but has 6 layers, 4 of which have N>=10. It looks like a pretty homogeneous place in Georgia.

We also need to check some of the really big places:

  • San Fran (160 - 12,946) is 52 on both (N=12, >=10). Picking any N>=10 up to N=15 will always find the largest hull count. At N=16 it finds the smallest large hull count (41 to 47 are all N=16).
  • Boston (22 - 3,506) is 17 with N=10 and 11 (N=17) with N>10 on both measures.
  • New York (122 - 11,466) is 27 with both measures with N>10 (N=18). It is 67 and then 66 at N=10 on both! N=12's 65 hulls is in fourth place.
  • Palo Alto (134 - 3,492) is 1 with both measures. Even allowing N<10 there isn't an issue until N=5.

The analysis was repeated but using only 1995 to 2017 inclusive (i.e., the modern era). The results were much more stable for N>=10. 20 out of 20 maximum R2 hull counts were also maximum adjusted R2 counts. N>=10 also seemed much less contentious, though N=11 (just under 50%) would have given the same result and N>=12 would have change only one result. Austin now maximized at 11 hulls - which is pretty much in the middle of its set. Boston was the same as before: 17 hulls with N=10 and 11 hulls with N>=11. Given the shape using N=11 or 12 might be preferred. New York now follows a similar pattern to Boston, and is 67 hulls with N>=10 and 27 hulls with N>=11. Again, the higher N result seems preferred. Palo Alto is now 18 hulls (out of 33 with 19 N>=10), which is a shame but hey. SF maxes out at 52 hulls, with N=12.

So, aside from the Palo Alto result, this is clearly a greatly preferred spec. I think N>=12 (just over half of the 23 years) is fair as a cut off. Also Portland, OR, maximizes at 4 hulls on both measures with N=15...

The only thing that we should change is the R2 estimation regression. Up until this point, we've been using:

pca nosinglemulti nopair nohull totsinglemulticount totpaircount tothullcount totpairlength tothullarea 
predict pc1 pc2 pc3, score
quietly capture reg growthinv17lf pc1 pc2 pc3 if placeid==`placeid' & numclusters==`clusters'  & lowesthighestflag==1 & year>=1995 & year <.

Issues and Solution

There are two issues. Why are we using a PCA? Just to get the number of regressors down? The dimensionality isn't that high. And more importantly, one or more PCA components may be picking up a scale effect. We don't want to use the scale regressors in R2 estimation, because they might drive the R2.

So the solution is to first regress to estimate the scale effect and then create residuals:

reg growthinv17lf growthinv17l numdealsl numstartupsl i.year i.placeid if lowesthighestflag==1 & year>=1995 & year <.
predict growthinv17lfres, residuals

We then can't use nohull in the regress so our variable list is as follows:

quietly capture reg growthinv17lfres nosinglemulti nopair totsinglemulticount totpaircount tothullcount totpairlength tothullarea if placeid==`placeid' & numclusters==`clusters'  & lowesthighestflag==1 & year>=1995 & year <.

Again using a cut off of 12, there's some slight divergence between the R2 and adjusted R2 maximization points, but not much (2 out of 20). The results looks pretty much like before for the first 20 with some minor differences. 12 out of 20 places have the same answer. The 7 of the 8 remaining are different by just a hull or two. Only Austin is different, with now 21 hulls rather than 11. The R2 adjusted for Austin maximizes at 7.

Checking the other cities leads to the following observations:

  • Portland is still maximized at 4 hulls
  • Boston maximizes at 16 hulls on R2 and 5 on R2 adjusted. Recall that it was at 11 with N>=11.
  • New York now maximizes at just 17 hulls, which is a massive drop. But it does look like a clean interior solution.
  • Palo Alto is down at 12 hulls.
  • SF is at 21 hulls, way down from its old value in the 50s.
  • The large set results look more interior and stable than before... the cutoff of 12 looks reasonable too.

Revisiting Portland


Portland doesn't have 4 clusters for any year before 2000, or for 2007 and 2009. For 5 year multiples the layers are as follows:

2000	15
2005	19
2010	19
2015	33

The resulting map has much more adjacency than overlap. Measuring the nearest hull edge and center distance for each hull in a year to each hull in the next year and averaging would compute two measures of hull persistence. The overlap area from year to year, either in total or as a fraction of the second year's (or smaller years) total area, would provide another measure of persistence.

What do we want to know?

So now we have 200 (ish) cities with their optimally selected hulls (we chose the best hull count that is constant from 1995 to 2017 using the lowest-highest occurrence of that count). And now we'd like to know:

  • Whether having fewer hulls is associated with growth, controlling for size -- it is: nohulll -.168335***
  • Whether having a greater hull density is associated with growth, controlling for size -- it is: tothulldensityl .0730263***
  • Whether having a higher fraction of locations inside hulls is associated with growth, controlling for size: -- it isn't: frachull -.1345406*
  • Whether having hulls closer together is associated with growth, controlling for size. We should put these layer in the list to build avghulldisthm and avgdisthm (see line 1335).

Houston, TX

We also want to know about Houston, TX.

tab place placeid if placeid >50 & placeid <100
//Houston is 83
tab chosenhullspcar2inc if placeid==83
SELECT year, layer FROM MasterLayers
WHERE place='Houston' AND statecode='TX' AND layer=lowesthighestlayer AND numclusters=10
ORDER BY year, layer;

1990	20
2000	45
2005	50
2010	30

print myDict["Houston_TX"];
[-95.836717, -95.014608, 29.515925, 30.155908]

Useful place data:

  • The Sears Building is at 29.8055662,-95.7145812 [34]
  • The four corners of the innovation corridor are [35]:
    • 100 Hogan St Houston, TX 77009 29.774004, -95.367127
    • Second Ward Houston, TX 29.759421, -95.346044
    • Orange Lot Houston, TX 77054 29.682241,
    • Buffalo Speedway Houston, TX 77025 29.695301, -95.426753

The policy evaluation methodology:

  • Find the optimal number of hulls. Choose the corresponding layer for 2017, or the closest layer to it. Call this the Houston2017 layer.
  • Run a regression across cities to estimate the effect of the number of hulls, total hull area, avg hull distance, fraction in hulls, and perhaps other measures, as well as the scale measures, to generate coefficients.
  • Plug in Houston2017's values.
  • Create an artificial Houston2017X layer that moves 1/x (x=4, for example) of all of Houston's startups into a single 25hmsq innovation district. Evaluate it!
  • Create an artificial Houston2017Y layer that moves 1/x of all of Houston's startups into the proposed innovation corridor. Evaluate it!
  • Also calculate the expected values for a city with Houston's characteristics (?) and plug those in.

Working off this regression:

xtreg growthinv17lf nohull nohullsq frachull frachullsq tothullarea tothullareasq avghulldisthm avghulldisthmsq, be
  growthinv17lf |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
         nohull |   .3839447   .0563233     6.82   0.000     .2727974     .495092
       nohullsq |  -.0098748   .0024791    -3.98   0.000     -.014767   -.0049826
       frachull |  -1.695035   2.044876    -0.83   0.408    -5.730354    2.340284
     frachullsq |  -.6243502   1.534811    -0.41   0.685    -3.653117    2.404416
    tothullarea |  -.0007168   .0003133    -2.29   0.023     -.001335   -.0000985
  tothullareasq |   1.44e-07   5.52e-08     2.60   0.010     3.48e-08    2.53e-07
  avghulldisthm |   .0157895   .0068875     2.29   0.023     .0021978    .0293813
avghulldisthmsq |  -.0001506   .0000556    -2.71   0.007    -.0002602   -.0000409
          _cons |   3.629909   .5882279     6.17   0.000     2.469112    4.790707

In 2017, Houston had the following characteristics:

year	layer	growthinv17	growthinv17l	nohull	nohullsq	frachull	frachullsq	tothullarea	tothullareasq	avghulldisthm	avghulldisthmsq			
2017	20	76.533	4.350703652	7	49	0.595744681	0.354911725	268.2674944	71967.44853	100.6191636	10124.21608	1		
				0.3839447	-0.0098748	-1.695035	-0.6243502	-0.0007168	1.44E-07	0.0157895	-0.0001506	3.629909		
				2.6876129	-0.4838652	-1.009808085	-0.221589206	-0.19229414	0.010363313	1.588726284	-1.524706942	3.629909	4.484347923	88.61914544

We are going to do the following:

  1. Calculate a base effect using the real data, much as above.
  2. Define some target areas - the innovation corridor and a new innovation district. The innovation districts location will have to be picked based on the data and its size determined by the optimal hull area (or using the 25hmsq suggested by the economists).
  3. Take the 2017 Houston data (providing we don't need f.growthinv or can come up with it seperately), reallocate a randomly selected 25% of startups to a random location in the target area and recompute the hulls and layers using the HCA script.
  4. Compute the effect on growth and do a back-of-the-envelope calculation for whether it is worth TIF financing.

Note that 0.001 decimal degrees is 111m or 1.11hm [36], so 2.5 hm (for a total of 5 per side, and so 25hmsq) is 0.002252 decimal degrees and 1.617hm (giving a hull area of 10hmsq, which at 0.95 startups per hectare -- essentially 1 per square block, accommodates 25% of Houston's 47 active startups in 2017 in 10.46025 hectares). Using centroid analysis on Houston's 2017 hulls, the optimal location for a 3.2 block by 3.2 block innovation district is just above the Galleria area in Uptown [37]. There is a business park directly adjacent to these coordinates which ia about the right size for the innovation district and currently includes the offices of Schlumberger, Alert Logic, Hire Priority, and others.

Supposing that all of Houston's 43 active startups were relocated, it doesn't much matter where you put them. One questions is whether the 4 sq mile innovation corridor would then be an improvement over the status quo, and how much worse it would be than a district that of the implied optimum density? Such a district, using 0.95 hectares per location, would have an implied hull area of around 45 hectares, or a 0.003011 decimal degrees deviation in four directions from a point (to give 4 corners of a square).

Group Means Regression

Once we have found optimum hull specifications within a city, they will not vary, or will vary very little, over time. We therefore want to use a between panel regression, also called a group means regression. See the following:

Image Analysis

Building Images

Use B&W:

  • 50% grey at 75% transparent for city outline
  • 50% grey at 50% transparent for hull
  • Black + for singleton (size 10pt), 50% transparent when in pair or hull
  • Black * for multiton (size 10pt), 50% transparent when in pair or hull
  • Black line for pair

Informs colors:

  • Orange: R240 G118 B34
  • Blue: R31 G61 B124
  • Green: R129 G190 B65

Town: Blue, 75% tranparent

Working with ArcPy

First version saved as E:\projects\agglomeration\Test.mxd

If the basemaps aren't available, connect to ERSI online using the icon in the system tray[38]

Basic set up is:

  • displayhulls layer=1 grey50%,trans50%,noborder
  • displaymultitons asterisk4,black,size10,trans50%
  • displaysingletons cross1,black,size10,trans50%
  • placetigerarea grey50%,trans75%,grey80%border
  • Reference
  • Basemap - World Light Grey Canvas

Change dataframe map to GCS 1984 and display to decimal degrees Saved as: FullHullReview2017.mxd

Open python window then:

#Load the map and create the dataframe
mxd = arcpy.mapping.MapDocument(r"E:\projects\agglomeration\FullHullReview2017Colored.mxd")
df = arcpy.mapping.ListDataFrames(mxd)[0]
df.credits="(c) Ed Egan, 2019"
#Now do the image generation!
myDict = {} 
#myDict["Burlington_VT"] = [-73.27691399999999,-73.176383,44.445927999999995,44.539927] 
#... See the entries in E:\projects\agglomeration\arcpydict.txt

for location in myDict:
  newExtent = df.extent
  newExtent.XMin = myDict[location][0]
  newExtent.XMax = myDict[location][1]
  newExtent.YMin = myDict[location][2]
  newExtent.YMax = myDict[location][3]
  df.extent = newExtent
  filename="E:\\projects\\agglomeration\\Images\\"+location +".png"
  arcpy.mapping.ExportToPNG(mxd, filename, resolution=144)


If you run into issues, it's useful to test things step by step:

#Test some exports, note that if the geocords are in a different system from the extent parameters, you'll be exporting blank images!
arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BurlingtonAuto1.png", resolution=144)
print df.extent
#-73.5977988105381 44.1185146554607 -72.8022011894619 44.680787974202 NaN NaN NaN NaN

#Test with Burlington, VT
newExtent = df.extent
newExtent.XMin =-73.219086
newExtent.XMax =-73.19356
newExtent.YMin = 44.460346
newExtent.YMax = 44.48325
df.extent = newExtent
arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BurlingtonAuto2.png", resolution=144)
arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BurlingtonAuto3.png", resolution=144)

#Test with Buffalo, NY (while looking at Burlington, VT)
newExtent = df.extent
newExtent.XMin =-78.95687699999999
newExtent.XMax =-78.795157
newExtent.YMin =42.826023
newExtent.YMax =42.966454999999996
df.extent = newExtent
arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BuffaloAuto1.png", resolution=144)
arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BuffaloAuto2.png", resolution=144)

Remove basemap credits[39]:

  • Click on World map layer
  • Insert->Dynamic Text->Service Level Credits
  • Set the symbol color to no color

Help pages:

Analyzing the results

The following issues became apparent (Counts out of 191 cities with 4 or more locations in 2017 and greater than $10m inv in a year over all time):

  1. Encapsulation - A small number of place boundaries are fully encapsulated inside of other geoplaces. We need to determine when this happens. The initial list includes Addison, Culver City, Santa Monica (might be extreme adjacency), and others. We need a query to work this out.
  2. Concavity (6 marked) - Some place boundaries are fairly extremely concave (for instance, Fort Lauderdale, FL, Birmingham, AL, Boulder, CO). This in itself isn't too much of an (addressable) issue. However, a small number of places have concavity and adjacency issues, which together lead to hull overlaps. This is ameriorated by removing outliers, but we should check them (e.g., Cary, NC, Morrisville, NC, the city next to Newark, CA, Roswell, GA)
  3. Adjacency (23 marked) - The entire of the valley has an adjacency issue (these weren't marked), as do a fairly large number of other cities. See Newport Beach, CA and others. Lexington, MA provides a nice example of containment despite adjacency. As does Cambridge, MA with the right outliers removed.
  4. Outliers (52 marked) - perhaps as many as 1 in 5 cities had one or two obvious outliers on a visual inspection.

Critical checks:

  • Addison, TX: encapsulation
  • Culver City, CA: encapsulation
  • Oklahoma City, OK: scale issue (one outlier in State House?)
  • Portland, ME: scale issue. Though Portland's place boundary contains an island and some sea area, making it very wonky, this isn't an issue.
  • San Juan Capistrano, CA: Just 2 locations (1 singleton and 1 multiton) and no hull. Note that we might want to omit this place.

We might also want to check Twin Cities. Here's the results:

place	statecode	Issue	Reason
Champaign	IL	No	Urbana isn't in the data
Phoenix	AZ	No	Mesa isn't in the data
San Francisco	CA	No	Twinned with Oakland!
Oakland	CA	No	Twinned with SF!
Stamford	CT	Yes	Norwalk 
Norwalk	CT	Yes	Stamford
New Haven	CT	No	Bridgeport isn't in the data
Tampa	FL	No	St. Petersburg
Portland	ME	No	South Portland isn't in the data
Minneapolis	MN	Yes	St. Paul
Bloomington	MN	No	Normal isn't in the data
Durham	NC	Yes?	Raleigh
Raleigh	NC	Yes?	Durham
Portland	OR	No	Vancouver isn't in the data
Bethlehem	PA	No	Allentown isn't in the data
Dallas	TX	Yes?	Fort Worth
Fort Worth	TX	Yes?	Dallas
Seattle	WA	No	Tacoma isn't in the data

A visual inspection suggests that Stamford and Norwalk might be better combined but don't really matter. Minneapolis and St. Paul are pretty separate and really separate after removing outliers. Rarleigh and Durham are completely separate (Cary is more of an issue), as are Dallas and Fort Worth and SF and Oakland.


The data suggests that there are 12 places that encapsulated by 7 other places:

SELECT A.place, A.statecode, B.place AS ContainedPlace, B.statecode AS ContainedStatecode
       FROM placetigerarea AS A
        JOIN placetigerarea AS B ON st_contains(ST_ConvexHull(A.placegeog::geometry),ST_ConvexHull(B.placegeog::geometry))
        WHERE NOT (A.place=B.place AND A.statecode=B.statecode);
place	statecode	containedplace	containedstatecode
Los Angeles	CA	Culver City	CA
Los Angeles	CA	Torrance	CA
Los Angeles	CA	El Segundo	CA
Los Angeles	CA	Santa Monica	CA
San Jose	CA	Santa Clara	CA
Fremont	CA	Newark	CA
Oakland	CA	Emeryville	CA
Cary	NC	Morrisville	NC
New York	NY	Jersey City	NJ
Dallas	TX	Richardson	TX
Dallas	TX	Addison	TX
Dallas	TX	Farmers Branch	TX

We could ignore, flag or discard these cites. A visual inspection suggests that Culver City, Torrence, El Segundo, Jersey City, and probably Richardson, Newark, and maybe Cary don't have any issues. Santa Monica, Santa Clara, Emeryville, Farmer's Branch and Addison do look like they have issues, but with the exception of Farmer's Branch and Addison, these are big cites and with lots of locations, so the issue should be washed out by removing outliers or otherwise appropriately choosing the clustering layer.

After reflection, we decided to deal merge the following places (listing geoids):

Santa Monica 0670000 -> LA 0644000
Santa Clara 0669084 -> San Jose 0668000
Emeryville 0622594 -> Oakland 0653000
Farmer's Branch 4825452 -> Dallas 4819000
Addison 4801240 -> Dallas 4819000
Newark 0650916 -> Freemont 0626000
Morrisville 3744520 -> Cary 3710740

Intersecting All Encompassing Hulls

52 places have all encompasing hulls intersect in our data (i.e., there are 26 intersections). This includes some of the places that suffer from encapsulation (especially Santa Monica, Santa Clara, Emeryville, Farmer's Branch and Addison). So beyond encapsulated places, there are an additional 20 intersections. These are:

place	statecode	intersectedplace	intersectedstatecode
Alpharetta	GA	Roswell	GA
Bellevue	WA	Redmond	WA
Boston	MA	Cambridge	MA
Boston	MA	Somerville	MA
Campbell	CA	San Jose	CA
Centennial	CO	Greenwood Village	CO
Cupertino	CA	San Jose	CA
Fremont	CA	Newark	CA
Greenwood Village	CO	Centennial	CO
Irvine	CA	Newport Beach	CA
Los Altos	CA	Mountain View	CA
Menlo Park	CA	Redwood City	CA
Milpitas	CA	San Jose	CA
Mountain View	CA	Palo Alto	CA
Mountain View	CA	Sunnyvale	CA
Newton	MA	Wellesley	MA
Phoenix	AZ	Tempe	AZ
Redwood City	CA	San Carlos	CA
San Jose	CA	Sunnyvale	CA
Santa Clara	CA	Sunnyvale	CA

At a glance, most of these appear big or very big startup ecosystems. Accordingly, any process that deals with outliers (etc.) should address this issue.

First Estimation(s)

Note that this subsection is now very out of date!

At this stage we have MasterLevels.txt and MasterLayers.txt as datafiles. MasterLevels.txt contains only layers corresponding to levels 0 through 12 and also has noothergeoms and avgdisthm as variables.

The questions we need to answer are: 1) Is there an agglomeration effect? 2) Which level or layer best describes a city (perhaps for a year, or perhaps over its life)?

We can just pick a level (say 25 hectares) and run a within-city regression:

. xtreg growthinv17l_f growthinv17l nosingletonl totmultitoncountl totpaircountl tothullcountl avgpairlengthl avghulldensi
> tyl avgdisthml i.year if level==6, fe cluster(placelevelid)

Fixed-effects (within) regression               Number of obs     =      5,027
Group variable: placelevelid                    Number of groups  =        198

R-sq:                                           Obs per group:
     within  = 0.4097                                         min =          3
     between = 0.8310                                         avg =       25.4
     overall = 0.5974                                         max =         37

                                                F(44,197)         =      78.20
corr(u_i, Xb)  = 0.4087                         Prob > F          =     0.0000

                              (Std. Err. adjusted for 198 clusters in placelevelid)
                  |               Robust
   growthinv17l_f |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
     growthinv17l |   .1388644   .0176074     7.89   0.000     .1041412    .1735877
     nosingletonl |   .1447935   .0402488     3.60   0.000     .0654197    .2241673
totmultitoncountl |   .0909545   .0481349     1.89   0.060    -.0039714    .1858803
    totpaircountl |   .1724367   .0383185     4.50   0.000     .0968695    .2480039
    tothullcountl |   .7120504   .0467915    15.22   0.000     .6197739    .8043269
   avgpairlengthl |  -.0219417    .023633    -0.93   0.354    -.0685478    .0246645
  avghulldensityl |    .049566   .0202756     2.44   0.015     .0095808    .0895511
       avgdisthml |   .0933327    .076309     1.22   0.223    -.0571546    .2438201


. xtreg growthinv17l_f growthinv17l numstartups numstartupssq nosinglemulti nosinglemultisq nohull nohullsq nopair nopairs
> q i.year if level==6, fe cluster(placelevelid)

Fixed-effects (within) regression               Number of obs     =      5,773
Group variable: placelevelid                    Number of groups  =        200

R-sq:                                           Obs per group:
     within  = 0.4017                                         min =          4
     between = 0.8425                                         avg =       28.9
     overall = 0.5708                                         max =         37

                                                F(45,199)         =      72.39
corr(u_i, Xb)  = 0.4222                         Prob > F          =     0.0000

                            (Std. Err. adjusted for 200 clusters in placelevelid)
                |               Robust
 growthinv17l_f |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   growthinv17l |    .220333    .018699    11.78   0.000     .1834595    .2572066
    numstartups |   .0062875   .0022944     2.74   0.007      .001763    .0108119
  numstartupssq |  -8.04e-07   1.14e-06    -0.70   0.483    -3.06e-06    1.45e-06
  nosinglemulti |   .0648575   .0168134     3.86   0.000     .0317023    .0980127
nosinglemultisq |  -.0021614   .0006336    -3.41   0.001    -.0034108    -.000912
         nohull |   .1747691   .0255105     6.85   0.000     .1244636    .2250747
       nohullsq |  -.0057148   .0012164    -4.70   0.000    -.0081136    -.003316
         nopair |   .0896908   .0248207     3.61   0.000     .0407455    .1386361
       nopairsq |  -.0097196   .0024153    -4.02   0.000    -.0144825   -.0049567

Note that the following don't work, either alone or with other variables (including numstartups and numstartupsq), probably because they are third-order effects:

. xtreg growthinv17l_f growthinv17l avghulldensity avghulldensitysq avgpairlength avgpairlengthsq avgdisthm avgdisthmsq i.
> year if level==6, fe cluster(placelevelid)

Fixed-effects (within) regression               Number of obs     =      5,027
Group variable: placelevelid                    Number of groups  =        198

R-sq:                                           Obs per group:
     within  = 0.3579                                         min =          3
     between = 0.5926                                         avg =       25.4
     overall = 0.3753                                         max =         37

                                                F(43,197)         =    2152.49
corr(u_i, Xb)  = 0.2529                         Prob > F          =     0.0000

                             (Std. Err. adjusted for 198 clusters in placelevelid)
                 |               Robust
  growthinv17l_f |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    growthinv17l |   .2668427   .0208574    12.79   0.000     .2257101    .3079752
  avghulldensity |   .0008076   .0003875     2.08   0.038     .0000433    .0015718
avghulldensitysq |  -1.14e-07   8.80e-08    -1.29   0.197    -2.87e-07    5.97e-08
   avgpairlength |  -.0018724   .0036128    -0.52   0.605    -.0089972    .0052524
 avgpairlengthsq |  -.0000296   .0000596    -0.50   0.620    -.0001471    .0000879
       avgdisthm |    .001429   .0035371     0.40   0.687    -.0055465    .0084045
     avgdisthmsq |   -.000012   .0000157    -0.76   0.447    -.0000429     .000019

We can also do it with fractions and their squares (omit fracsinglemulti). However at level 6 (25 hectare), pairs seems more important than hulls:

. xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracpair fracpairsq frachull frachullsq i.year if level==6, 
> fe cluster(placelevelid)

Fixed-effects (within) regression               Number of obs     =      5,773
Group variable: placelevelid                    Number of groups  =        200

R-sq:                                           Obs per group:
     within  = 0.3919                                         min =          4
     between = 0.8456                                         avg =       28.9
     overall = 0.5274                                         max =         37

                                                F(43,199)         =      62.34
corr(u_i, Xb)  = 0.4268                         Prob > F          =     0.0000

                          (Std. Err. adjusted for 200 clusters in placelevelid)
              |               Robust
growthinv17~f |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 growthinv17l |   .2481436   .0181447    13.68   0.000     .2123631     .283924
  numstartups |   .0100673   .0019792     5.09   0.000     .0061644    .0139702
numstartupssq |  -6.92e-06   2.03e-06    -3.41   0.001    -.0000109   -2.92e-06
     fracpair |   .7540177   .3709212     2.03   0.043     .0225772    1.485458
   fracpairsq |     -1.936   .7030942    -2.75   0.006    -3.322472   -.5495289
     frachull |   .1969853    .562807     0.35   0.727    -.9128457    1.306816
   frachullsq |  -.1491389   .3878513    -0.38   0.701    -.9139649     .615687

Whereas across all levels:

. xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracpair fracpairsq frachull frachullsq i.year, fe cluster(p
> lacelevelid)

Fixed-effects (within) regression               Number of obs     =     76,623
Group variable: placelevelid                    Number of groups  =      2,600

R-sq:                                           Obs per group:
     within  = 0.3956                                         min =          4
     between = 0.8330                                         avg =       29.5
     overall = 0.5279                                         max =         37

                                                F(43,2599)        =     827.33
corr(u_i, Xb)  = 0.4143                         Prob > F          =     0.0000

                        (Std. Err. adjusted for 2,600 clusters in placelevelid)
              |               Robust
growthinv17~f |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 growthinv17l |   .2522524   .0049725    50.73   0.000     .2425019    .2620028
  numstartups |   .0100677   .0005323    18.91   0.000     .0090239    .0111114
numstartupssq |  -6.95e-06   5.51e-07   -12.62   0.000    -8.03e-06   -5.87e-06
     fracpair |   .4152028   .0956859     4.34   0.000     .2275745    .6028311
   fracpairsq |  -.9753654   .1271631    -7.67   0.000    -1.224717   -.7260141
     frachull |  -.8606939   .1231519    -6.99   0.000     -1.10218   -.6192081
   frachullsq |    .495785   .0976557     5.08   0.000     .3042942    .6872758

This is probably because of the variation in hulls vs pairs at level 6, which has lots of cities with nothing in pairs and everything in hulls. We might want to 'control' for cityarea by restricting our within city analysis to large enough cities. A 25 hectare target area might be too encapsulating -- more than 10% of observations are 100% in hulls:

. su frachull if level==6, det

      Percentiles      Smallest
 1%     .2162162       .1153846
 5%     .3333333       .1153846
10%     .4285714       .1428571       Obs               6,032
25%           .6       .1428571       Sum of Wgt.       6,032

50%           .8                      Mean           .7539126
                        Largest       Std. Dev.      .2209328
75%     .9666666              1
90%            1              1       Variance       .0488113
95%            1              1       Skewness      -.6390206
99%            1              1       Kurtosis       2.400947

I tried using R2 to select levels, but only the second spec had an interior solution (at level 3):

forvalues i=1/12 {
	quietly capture xtreg growthinv17l_f growthinv17l nosinglemulti nosinglemultisq nohull nohullsq nopair nopairsq avghulldensity avghulldensitysq avgpairlength avgpairlengthsq avgdisthm avgdisthmsq i.year if level==`i', fe cluster(placelevelid)
	display "Reg: 1   level: " `i' "   r2-within: " `e(r2_w)'
forvalues i=1/12 {
	quietly capture xtreg growthinv17l_f growthinv17l nosinglemulti nosinglemultisq nohull nohullsq nopair nopairsq i.year if level==`i', fe cluster(placelevelid)
	display "Reg: 2   level: " `i' "   r2-within: " `e(r2_w)'
forvalues i=1/12 {
	quietly capture xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracsinglemulti fracsinglemultisq fracpair fracpairsq frachull frachullsq i.year if level==`i', fe cluster(placelevelid)
	display "Reg: 3   level: " `i' "   r2-within: " `e(r2_w)'
Reg: 2   level: 1   r2-within: .39552569
Reg: 2   level: 2   r2-within: .3998779
Reg: 2   level: 3   r2-within: .40348691
Reg: 2   level: 4   r2-within: .40130097
Reg: 2   level: 5   r2-within: .39931203
Reg: 2   level: 6   r2-within: .39707046
Reg: 2   level: 7   r2-within: .39366909
Reg: 2   level: 8   r2-within: .38957831
Reg: 2   level: 9   r2-within: .38398108
Reg: 2   level: 10   r2-within: .37662604
Reg: 2   level: 11   r2-within: .36999057
Reg: 2   level: 12   r2-within: .38393843

However, doing this by exgroup (0 t0 3), gives the same result - level 3 - for each exgroup.

Alternative approaches are to use AIC/BIC, or maybe entropy. For the same set of variables, in the same model, AIC/BIC are minimized when R2 is maximized, they are only useful when choosing the combination of the variables/estimation and the level. And it seems we can only do entropy one variable at a time:

entropyetc nohull if level==1

TIF data

See the TIF Project page for details on the TIF data. The section that was originally on this page was moved there.

TIF Analysis

Using Burlington, VT at the elbow, we plotted the hulls and TIFs. This is in WorkingMapV2.mxd. Some potential measures include:

  • Overlapping Hull and TIF area
  • Fraction of Hull area covered by TIFs
  • Non-overlapping Hull and TIF area
  • Adjacent Hull and TIF area (and non-adjacent) -- expand out hull area to allow for new inclusion without affecting density
  • Count or Fraction of locations in TIF areas

The data needs to be reprocessed to be in the format:

place   statecode   year   tifname   geog

Or at least start-year and end-year...

Plotting Chicago (-87.7, 41.9) in 2017 using chosenlayer (layer=92) reveals some insights. There aren't any TIFs in the city core, but there are lots of startups. And conversely likewise, there are lots of TIFs covering suburbs that have few if any startups. There are, however, some areas of overlap. And some possible pattern of startups appearing exactly where TIFs aren't -- In a few cases (one notable), the startups are essentially surrounded.

And Houston in 2017, using layer=20 (as no chosen exists and 20 has max nohull at 7), it seems that there is little overlap between hulls and TIRZ but considerable overlap between Rice's "Innovation Corridor". The Sears Building, at (29.7255505,-95.3887225) [40] and Houston Exponential's building at (29.7521572,-95.3778919) [41] are both in the Midtown TIF, albeit at opposite ends! There are other TIFs impacted by Rice's proposal too. The East Downtown, Market Square, Montrose, and at least three others all intersect the "planned" innovation corridor.

Note that The Sears Building Area wasn't in the Midtown TIF when they did the last bond issue in 2015 (see map on A-1 [42]). They raised $13.5m in 2015 to pay off existing debt, and then raised $39.31m in 2017 [43] to conduct the Plan, pay off debt, etc. The area was in the map in the 2017 issue. The area around the Sears building "contains virtually no taxable property and therefore will produce no significant Captured Appraised Value".

Density Maps

It's useful to lay each year's chosen hulls on top of each other over time (from say 1995 to present, or using the layer with the max number of hulls if the year never achieves the chosen hull count). However, to do this we should expand the hulls, because all hulls have points at their corners and most hulls have points only at their corners. I propose using ST_Expand to increase X and Y distances separately. The method would be to take ((Centroid's X-ST_XMin)+(ST_Xmax-Centroid's X)/2) as the X expansion distance and likewise for Y.

Correction - we definately don't want ST_Expand as it creates a bounding box. We want ST_Buffer, but how big a buffer? Half of the maximum width is easy.

   (SELECT geom FROM mylayer WHERE gid=1),
   (SELECT geom FROM mylayer WHERE gid=1))

Also, transparency doesn't stack within a layer... See https://gis.stackexchange.com/questions/91537/how-to-vary-the-transparency-of-symbols-within-a-single-layer-in-arcmap Rather than half of the maximum width, we could use the average distance from a corner to another corner (i.e., pretty much between locations) divided by two. Use ST_DumpPoints(geometry geom) to recover the corners of the convex hulls. We should also put some other places on the map (long,lat):

  • Mercury Fund [44] at (-95.4306675,29.7331018)
  • The Galleria [45] at (-95.4627705,29.7411474).
  • The center of Downtown [46] at (-95.3755815,29.7575263)
  • Rice University [47] at (-95.4040199,29.7173941)
  • The Energy Corridor [48] at (-95.7131468,29.8698068)
  • Houston Community College - Spring Branch Campus [49] at (-95.5631846,29.7841191)
  • Westchase Neighborhood [50] at (-95.5832518,29.72609)
  • Houston Community College Alief Hayes Campus [51] at (-95.5770254,29.7336065)

Visually, it is easy to layer the years, using opacity to build up an effect over time. In the data, it is more difficult. Each year could have thickness one and then we could count the number using ST_intersects while creating the new hulls using ST_Intersection (returns null when no interection). If there are more than one intersections with the highest intersects count, then we could take the largest one as the ultimate one. The centroid of the ultimate intersection would be the heart of a city's startup scene.

I also added the roads from the Tiger Line Shapefile for Harris County[52]:

shp2pgsql -c -D -s 4269 -W "latin1" -I tl_2013_48201_roads.shp tlharris | psql -U researcher -d vcdb3

As well as the US national file for the coastlines:

shp2pgsql -c -D -s 4269 -W "latin1" -I tl_2017_us_coastline.shp tlcoastline | psql -U researcher -d vcdb3

Unfortunately, this doesn't show lakes... You can get all lines from https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2019&layergroup=All+Lines

But you have to do them county by county.

shp2pgsql -c -D -s 4269 -W "latin1" -I tl_2019_50007_edges.shp tlchittenden | psql -U researcher -d vcdb3

And there's so many features...

One problem with this method is that there are partial intersections. We could use ST_Difference to return the part that doesn't intersect... a bigger problem is that we are restricted to pairs of geometries. Using a cross product we could test all rows against all other rows. But then we'd need to aggregate the intersections... One method is to use a recursive CTE [53]. ST_Union is truly an aggregate function but not what we want in this context.

Another thing that might be an issue is that when hulls are expanded, they may intersect within a year too. Counting across year and within year intersections the same would simplify this, but it might be important to track them separately?


See also:

Old Work Using Circles

Very Old Summary

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.


Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

Regression equation.png

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument. The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.


Making the circle input data

Ed's additional datawork is in


The key table for circle processing is CoLevelBlowout, which is restricted (to include cities with greater than 10 active at some point in the data) to make CoLevelForCircles.

We need to:

  1. Winsorize CoLevelBlowout
  2. Compute the circles!
  3. Make the Bay Area (over time) data
  4. Plot the Bay Area data (with colors per Bay Area city) for 1985 to present
  5. Combine the plots to make an animated gif

To winsorize the data we need the formula for Great Circle Distance. The radius of the earth is 6,378km (half of diameter: 12,756 km). So:

GCD = acos( sin(lat1) x sin(lat2) + cos(lat1) x cos(lat2) x cos(long1-long2) ) x r

Main Sources

The primary sources of data for this project are:

  • SDC VentureXpert - from VC Database Rebuild, the key table is
  • GIS City Data
  • Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in Hubs.

VC data

Data on the number of new vc backed firms in each city and year is in:

Z:\Hubs\2017\clean data
The name of the file is firm_nr.txt.

Database is cities SQL script is: nr_firms.sql

Raw data is in:

The file is colevelsimple.txt

In order to see if there are outliers, I get the average coordinates for all cities and find the differences of the firm's coordinates from the city coordinate. The script for the average city coordinates is in

Z:\Hubs\2017\sql scripts and the file name is newcolevel.sql.

The differences are taken in excel. The file containing the differences is in

Z:\Hubs\2017 and the file name is new_colevel.txt.
  • Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is circles.txt. (It contains only 106 observations)

Database is cities SQL script is: circles.sql

The script for joining the two tables on the VC table is in:

Z:\Hubs\2017\sql scripts
 The name of the file is new_firm_nr_circles.sql
  • We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is CitiesWithGT10Active.txt

The script for joining the final data with this file is located in

Z:\Hubs\2017\sql scripts
The file name is final_joined_kerda.sql.

The final data is in

Z:\Hubs\2017\clean data
The file name is new_final_kerda.txt.

Accelerator data

Accelerators data is in

Z:\Hubs\2017\clean data
The file name is accelerators.txt
The table is accelerators

The joined accelerators data with the VC table is in joined_accelerators table. The script is in

Z:\Hubs\2017\sql scripts
The file name is join_accelerators.sql

The do file is in

The name is agglomeartion_kerda.do

It includes the graphs, tables and the preliminary FE regressions with VC funding amount and growth rate. It also predicts the hazard rates, matches on the hazard rate in order to create synthetic control and treatment groups. What is left to do is to add 2 lagged and 3 forward observations for the cities which do have a match. Remove the overlapping observations for the years that get a treatment but which at the same time serve as a control.

See also


Entrepreneurship, small businesses and economic growth in cities:

Specifities/ Outliers to consider

New York (decompose)
Princeton area (keep Princeton  unique)
Reston, Virginia (keep)
San Diego (include La Jolla)
Silicon Valley (all distinct)

Unbiased measure

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

Census Data


The Census Gazetteer files for 2010, 2000 and 1990 can give use population by census place. See https://www.census.gov/geo/maps-data/data/gazetteer.html

The places file contains data for all incorporated places and census designated places (CDPs) in the 50 states, the District of Columbia and Puerto Rico as of the January 1, 2010. The file is tab-delimited text, one line per record. Some records contain special characters.
Download the National Places Gazetteer Files (1.2MB)
Download the State-Based Places Gazetteer Files:
Column	Label	Description
Column 1	USPS	United States Postal Service State Abbreviation
Column 2	GEOID	Geographic Identifier - fully concatenated geographic code (State FIPS and Place FIPS)
Column 3	ANSICODE	American National Standards Insititute code
Column 4	NAME	Name
Column 5	LSAD	Legal/Statistical area descriptor.
Column 6	FUNCSTAT	Functional status of entity.
Column 7	POP10	2010 Census population count.
Column 8	HU10	2010 Census housing unit count.
Column 9	ALAND	Land Area (square meters) - Created for statistical purposes only.
Column 10	AWATER	Water Area (square meters) - Created for statistical purposes only.
Column 11	ALAND_SQMI	Land Area (square miles) - Created for statistical purposes only.
Column 12	AWATER_SQMI	Water Area (square miles) - Created for statistical purposes only.
Column 13	INTPTLAT	Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively
Column 14	INTPTLONG	Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude respectively.


See https://www.census.gov/geo/maps-data/data/relationship.html

These text files describe geographic relationships. There are two types of relationship files; those that show the relationship between the same type of geography over time (comparability) and those that show the relationship between two types of geography for the same time period.

ACS (American Community Survey) Data

Steps to download:

1) Go to https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml
2) Select 'I know the dataset or table(s) that I want to download.'
3) Press Next
4) For 'Select a program:' choose
       'American Community Survey'
5) For 'Select a dataset and click Add to Your Selections:' choose
       '<YEAR OF INTEREST> ACS 1-year estimates'
6) Press 'Add To Your Selections'
7) Press Next
8) For 'Select a geographic type:' choose
       'Place - 160'
9) For Select a state:
       Don't choose a state, as we wish to download all.
10) For 'Select one or more geographic areas...' choose
       'All Places within United States and Puerto Rico'
11) Press Next


Counts of firms by NAICS code at the county level may be useful: https://www2.census.gov/geo/pdfs/education/cbp12gdbs.pdf

Tax Increment Finance Zones