Urban Start-up Agglomeration and Venture Capital Investment

From edegan.com
Jump to navigation Jump to search
Academic Paper
Title Urban Start-up Agglomeration and Venture Capital Investment
Author Ed Egan
RAs Peter Jalbert, Jake Silberman, Christy Warden, Jeemin Sim
Status In development
© edegan.com, 2016


New Work

Files

New files are in E:\projects\agglomeration

  • agglomeration.sql -- new main sql file uses vcdb3

Contents of the Old Code subdirectory:

  • Agglomeration.SQL -- This was the replacement code that was run when vcdb3 was made. We had to work through it to update the counts. Came from E:\projects\vcdb3\OriginalSQL\Agglomeration.SQL
  • Agglomeration.sql -- Old version that makes colevel data for python script. Came from E:\mcnair\Projects\Agglomeration
  • Analysis.sql -- Old version builds from HCL forward. Came from E:\mcnair\Projects\Agglomeration

See also:

CoLevelForCircles

Note: Make sure that the geocoding has been fixed first! See Restoring_vcdb3#Fix_the_Geocoding. Note that we are working with decimal degrees to six decimal places from Google Maps, which is equivalent to a staggering 11cm of accuracy at the equator. See http://wiki.gis.com/wiki/index.php/Decimal_degrees.

The build is as follows:

  • portcomaster 48001 (where hadgrowth=1 34761) and portcogeo (47715 where latitude and longitude NOT NULL 47715) -> colevelsimple 33869
  • colevelsimple 33869-> CoPoints (adds point geoms)
  • tigerplaces 29322 (has place geoms) -> tiger2 (adds areas)
  • tiger2,CoPoints ST_Intersects -> colevelbase 29442
  • colevelbase,years -> colevelblowout 219060, CoLevelSummary 35344, PlacesWithGT10Active 200
  • colevelblowout,PlacesWithGT10Active -> CoLevelForCircles 171170

HCA

Take CoLevelForCircles.txt and feed it into the HCA script in

E:\projects\hca\main.py

See also:

Take results.tsv and load it as the HCL table.

Layers and levels

See Agglomeration.sql for the following build:

  • hcl loaded from results.tsv 29751998
  • Determine if singleset, multiset or hullset (fully partitions data)
  • Make hclsingletons, hclmultitons and hclhulls (which contains both hulls and lines)
  • UNION them together to create hclmain (14875999), which contains geographies
  • Make hcllayer 163887: Aggregate to the layer level calculating nosingleton, nomultiton, etc., as well as tothullarea etc.

Note that everything uses Geographies (except to find centroids), and pair lengths and hull areas are scaled so that they are in hm and hectares (hm2) [1], as this is close to being a city block length and square block [2] at least in Houston, TX (note that a block in Chicago is 2 Houston blocks and there is no standard block definition).

Then load up leveldefinitions.txt. Note that we are using more levels than before, with finer grained levels at the bottom end:

Level	Label	Target
1	100msq	0.01
2	1blfront	0.1
3	1blsq	1
4	5blsq	5
5	10blsq	10
6	25blsq	25
7	50sqbl	50
8	1kmsq	100
9	5kmsq	400
10	10kmsq	1000
11	20min rule	35000

Then build the nohulls as level 0 and the allhull as level 12. Note that we are going to have to throw out panels or observations with too few points per city-year later, as these can have singletons, multitons, or pairs as their allhulls. This can be done with nohull !=0. Also nohulls can be built using hcllayer then it will contain pairs, or from colevelforcircles. I opted for the later, so that it only contains singletons and multitons.

Finally in this part, build hcllevels and hcllayerwzero. For hcllevels we are going to compute mean distances between clusters. It is computational infeasible to do this for all layers. And then for all layers (inc zero) we are going to run our selection regression.

For the next steps on the data see Jeemin Sim (Work Log). This includes details of how to load the TIF data.

Approach

We want to choose some layers to work with. https://en.wikipedia.org/wiki/Hierarchical_clustering notes that "One can always decide to stop clustering when there is a sufficiently small number of clusters (number criterion). Some linkages may also guarantee that agglomeration occurs at a greater distance between clusters than the previous agglomeration, and then one can stop clustering when the clusters are too far apart to be merged (distance criterion)."

In a similar vein, https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set describes the elbow method, using AIC/BIC, etc., as well as an Information theoretic approach, the silhoette method, etc.

For us:

  1. Discarding outliers
  2. Elbow on fraction of locations in hulls
  3. Chosen by the researcher
  4. Maximum R-Squared

We also looked at:

  • Elbow on fraction of maximum hull area in hulls

And finally, we need to think about:

  • Reasonable exclusions

Discarding Outliers

We don't need to discard outliers, per se, just find a layer where outliers are singletons. A wrong approach is to take the highest layer with a single hull (or two hulls or three hulls, etc.). It is fair that if a layer never has a hull, then presumably it only has a single location or a line of locations (note that it is possible for a line to have more than 2 locations both because of multitons and because of perfect alignment, given our Google Maps accuracy), so we can discard it. However, this approach will find when there is just one hull left, rather than the last time that there is one hull in decomposition.

Possible options and issues:

  • Find when there are first two hulls and then step back a layer -- but there might never be two hulls, so if there is only ever one hull then find the max layer.
  • Form a chain from layer 1 on down that breaks when there is no longer just one hull. Perhaps count the number links, grouping where the chain has one hull or not, or require that the chain contain level 1... [3]

We went with the first option. The base table for this approach is hcllayer. The variables are highest1hulllayer, highest2hulllayer, and highest3hulllayer in the highesthulllayer table.

It is worth noting that the highest1hulllayer occurs on average at around 21.4% unclustered (with std dev. of 20.2%). These percentages go down alightly for highest2hulllayer and highest3hulllayer because cities that have 2 or 3 (or more) hulls have larger ecosystems and so more layers.

Elbow on fraction of locations in hulls

AgglomerationFLHGraph.png

The elbowcalc and elbowdata queries provide the data. elbowdata takes layer/finallayer (i.e., fraction unclustered, as the layer 1 is the all encompassing hull and final layer is the raw locations), rounds it to two digits, and then calculates the average fraction of locations in hulls and the average hull area fraction of all encompassing hull area. The former gives a nice curve with an elbow (found by taking the second derivative and setting it equal to zero) at x=0.40237.

We then identify the layer that is closest to having a fraction of locations in hulls of 0.40237, taking the lower level (i.e., the more clustered level) whenever there is a tie. The resulting indicator variable is called elbowflhlayer and is made in table Elbowflh. This is analyzed in a sheet in "Images Review.xlsx" in E:\projects\agglomeration.

Fraction of Maximum Hull Area in Hulls

AgglomerationFHAGraph.png

We also tried computing the fraction of the maximum hull area (MHA) in covered by hulls for each layer. The maximum hull area is on layer one, when every location is in an all-encompassing hull. We excluded data from layer one as well the final layer because they lead small data issues.

A cubic was a mediocre fit to this data, giving an R2 of 83% but with lots of deviation concentrated right around the local minimum ({-0.0224722, {x -> 0.446655}} [4], point of inflection and local maximum. A quartic had an R2 of 90% at around x=0.44 (6.408 x^4 - 15.176 x^3 + 12.592 x^2 - 4.3046 x + 0.517≈0.00825284 at x≈0.440275). I tried a quintic and it had inflection points are x=0.33, 0.55, and 0.82, as well as local maxima at 0.39 and 0.90. Visually there seems to be something going on in the 20% to 40% uncovered range too, perhaps a bifurcation of results, which might be due to rounding issues.

Reasonable Exclusions

We started by including all U.S. cities that received at least $10m of growth venture capital in a year between 1980 and 2017 (inclusive). This gave us a list of 200 cities. However, we still have a lot of city-years with low number of startups.

What is a reasonable number of startups to analyze agglomeration? Three locations (which is at least three startups) is the bare minimum required for one hull without excluding outliers. And we only made images for places with 4 or more startups. A visual inspection suggests that while there is greater (relative) dispersion when counts are low, it isn't hugely problematic. It is also worth noting that excluding 4 or less would get rid of Farmer's Branch, Fort Lauderdale, and Tempe (and Bloomington, MN) in 2017, and 6 or less in 2017 would eliminate Cary and Addison, all of which are slightly problematic. Burlington, VT has 7 years in the data with more than 6 startups, and one with 6.

But everywhere (i.e., all 200 places) have 10 or more layers at some point in time. And everywhere has at least 6 years with 6 or more observations. Detroit has just 7 obs that meet this criteria, half the number of Germantown, MD and a third of Greenwood Village, CO.! Requiring a year to have six observations would reduce us to 4916 observations from 6702 (i.e., down to 73% of the data). Requiring 9 would reduce the data down to 3889 obs (58%), and we'd lose more observations as places wouldn't have enough to form a time-series. The answer then appears to be to limit to observations with 6 or more layers. We'll code the number of layers, and the max and min number of layers for a place, into the data.

Maximum R-squared

Portland3HullsHighest.png

Using a maximum R-squared approach to find the 'best layer' for a city is inherently problematic. A city might have 5 layers in 1980 and 80 layers in 2017, and so using layer 40, say, irrespective of year is somewhat meaningless. There are several alternative that make more sense. One is to use the fraction unclustered, much like with the elbow approach. The other is to find the layer with a certain hull count (or as close to it as possible). Hulls might tend to be somewhat stable over time, so three hulls in Portland in 2017 will be centered in more or less the same place as three hulls in Portland in 2003. This turns out to be somewhat true, as seen in the image on the right, which uses the last time (highest level) that there are three hulls, or two for 1998 and 1993 (one of which is out of frame). One issue with this approach is that the highest level with a certain hull count is that hulls almost always contain just three points.

Portland3HullsLowest.png

An obvious alternative approach is to use the first time (lowest level) that there are three hulls. There is a big difference in the layer numbers for this. See the queries below. Essentially, the HCA algorithm often takes an original hull apart and then takes the resulting hulls apart, giving a quadratic for hull count again layer. But, as is apparent in the image, this leads to big areas that would only be good for overlap analysis and not for identifying individual clusters.


--Lowest,highest, and lowest-highest, and first-after-peak level where there are three hulls
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2015' ORDER BY year,layer;
--3, 41, 37, 33 (peak at 22,23,24,25)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2010' ORDER BY year,layer;
--3, 24, 24, 24 (peak at 18)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2005' ORDER BY year,layer;
--7, 23, 21, 21(peak at 18)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2000' ORDER BY year,layer;
--3, 17, 17 , 17 (peak at 12,13,14)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='1995' ORDER BY year,layer;
--1(1), 8(1), 8(1), 8(1) (no peak-just flat but still take the highest layer with the nearest lower number of hulls)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='1990' ORDER BY year,layer;
--2(2), 5(2), 5(2), 5(2) (peak at 2,3,4,5 so 5 is first 'after' peak? but still take the highest layer with the nearest lower number of hulls)
SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='1985' ORDER BY year,layer;
--1(1), 1(1), 1(1), 1(1) (peak at 1 as that's the only layer with a hull but still take the highest layer with the nearest lower number of hulls) 

right|300px Two further options are to find the lowest-highest and the first-after-peak. These often coincide. The lowest-highest finds the highest layer with x hulls and then works back down the layers taking the lowest in the continuous chain of x hulled layers. The first-after-peak finds the first layer with x hulls in or after the layers where there is the peak number of hulls. This last approach is a little problematic because sometimes there isn't a peak - its just flat - and it is inconceivable that there could be two or more peaks.

Image Analysis

Building Images

Use B&W:

  • 50% grey at 75% transparent for city outline
  • 50% grey at 50% transparent for hull
  • Black + for singleton (size 10pt), 50% transparent when in pair or hull
  • Black * for multiton (size 10pt), 50% transparent when in pair or hull
  • Black line for pair

Informs colors:

  • Orange: R240 G118 B34
  • Blue: R31 G61 B124
  • Green: R129 G190 B65

Town: Blue, 75% tranparent

Working with ArcPy

First version saved as E:\projects\agglomeration\Test.mxd

If the basemaps aren't available, connect to ERSI online using the icon in the system tray[5]

Basic set up is:

  • displayhulls layer=1 grey50%,trans50%,noborder
  • displaymultitons asterisk4,black,size10,trans50%
  • displaysingletons cross1,black,size10,trans50%
  • placetigerarea grey50%,trans75%,grey80%border
  • Reference
  • Basemap - World Light Grey Canvas

Change dataframe map to GCS 1984 and display to decimal degrees Saved as: FullHullReview2017.mxd

Open python window then:

#Load the map and create the dataframe
mxd = arcpy.mapping.MapDocument(r"E:\projects\agglomeration\FullHullReview2017Colored.mxd")
df = arcpy.mapping.ListDataFrames(mxd)[0]
df.credits="(c) Ed Egan, 2019"
 
#Now do the image generation!
myDict = {} 
#myDict["Burlington_VT"] = [-73.27691399999999,-73.176383,44.445927999999995,44.539927] 
#... See the entries in E:\projects\agglomeration\arcpydict.txt

for location in myDict:
  newExtent = df.extent
  newExtent.XMin = myDict[location][0]
  newExtent.XMax = myDict[location][1]
  newExtent.YMin = myDict[location][2]
  newExtent.YMax = myDict[location][3]
  df.extent = newExtent
  df.panToExtent(df.extent)
  filename="E:\\projects\\agglomeration\\Images\\"+location +".png"
  arcpy.mapping.ExportToPNG(mxd, filename, resolution=144)

#arcpy.RefreshActiveView()
#mxd.save()

If you run into issues, it's useful to test things step by step:

#Test some exports, note that if the geocords are in a different system from the extent parameters, you'll be exporting blank images!
arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BurlingtonAuto1.png", resolution=144)
print df.extent
#-73.5977988105381 44.1185146554607 -72.8022011894619 44.680787974202 NaN NaN NaN NaN

#Test with Burlington, VT
newExtent = df.extent
newExtent.XMin =-73.219086
newExtent.XMax =-73.19356
newExtent.YMin = 44.460346
newExtent.YMax = 44.48325
df.extent = newExtent
arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BurlingtonAuto2.png", resolution=144)
df.panToExtent(df.extent)
arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BurlingtonAuto3.png", resolution=144)

#Test with Buffalo, NY (while looking at Burlington, VT)
newExtent = df.extent
newExtent.XMin =-78.95687699999999
newExtent.XMax =-78.795157
newExtent.YMin =42.826023
newExtent.YMax =42.966454999999996
df.extent = newExtent
arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BuffaloAuto1.png", resolution=144)
df.panToExtent(df.extent)
arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BuffaloAuto2.png", resolution=144)

Remove basemap credits[6]:

  • Click on World map layer
  • Insert->Dynamic Text->Service Level Credits
  • Set the symbol color to no color

Help pages:

Analyzing the results

The following issues became apparent (Counts out of 191 cities with 4 or more locations in 2017 and greater than $10m inv in a year over all time):

  1. Encapsulation - A small number of place boundaries are fully encapsulated inside of other geoplaces. We need to determine when this happens. The initial list includes Addison, Culver City, Santa Monica (might be extreme adjacency), and others. We need a query to work this out.
  2. Concavity (6 marked) - Some place boundaries are fairly extremely concave (for instance, Fort Lauderdale, FL, Birmingham, AL, Boulder, CO). This in itself isn't too much of an (addressable) issue. However, a small number of places have concavity and adjacency issues, which together lead to hull overlaps. This is ameriorated by removing outliers, but we should check them (e.g., Cary, NC, Morrisville, NC, the city next to Newark, CA, Roswell, GA)
  3. Adjacency (23 marked) - The entire of the valley has an adjacency issue (these weren't marked), as do a fairly large number of other cities. See Newport Beach, CA and others. Lexington, MA provides a nice example of containment despite adjacency. As does Cambridge, MA with the right outliers removed.
  4. Outliers (52 marked) - perhaps as many as 1 in 5 cities had one or two obvious outliers on a visual inspection.

Critical checks:

  • Addison, TX: encapsulation
  • Culver City, CA: encapsulation
  • Oklahoma City, OK: scale issue (one outlier in State House?)
  • Portland, ME: scale issue. Though Portland's place boundary contains an island and some sea area, making it very wonky, this isn't an issue.
  • San Juan Capistrano, CA: Just 2 locations (1 singleton and 1 multiton) and no hull. Note that we might want to omit this place.

We might also want to check Twin Cities. Here's the results:

place	statecode	Issue	Reason
Champaign	IL	No	Urbana isn't in the data
Phoenix	AZ	No	Mesa isn't in the data
San Francisco	CA	No	Twinned with Oakland!
Oakland	CA	No	Twinned with SF!
Stamford	CT	Yes	Norwalk 
Norwalk	CT	Yes	Stamford
New Haven	CT	No	Bridgeport isn't in the data
Tampa	FL	No	St. Petersburg
Portland	ME	No	South Portland isn't in the data
Minneapolis	MN	Yes	St. Paul
Bloomington	MN	No	Normal isn't in the data
Durham	NC	Yes?	Raleigh
Raleigh	NC	Yes?	Durham
Portland	OR	No	Vancouver isn't in the data
Bethlehem	PA	No	Allentown isn't in the data
Dallas	TX	Yes?	Fort Worth
Fort Worth	TX	Yes?	Dallas
Seattle	WA	No	Tacoma isn't in the data

A visual inspection suggests that Stamford and Norwalk might be better combined but don't really matter. Minneapolis and St. Paul are pretty separate and really separate after removing outliers. Rarleigh and Durham are completely separate (Cary is more of an issue), as are Dallas and Fort Worth and SF and Oakland.

Encapsulation

The data suggests that there are 12 places that encapsulated by 7 other places:

SELECT A.place, A.statecode, B.place AS ContainedPlace, B.statecode AS ContainedStatecode
       FROM placetigerarea AS A
        JOIN placetigerarea AS B ON st_contains(ST_ConvexHull(A.placegeog::geometry),ST_ConvexHull(B.placegeog::geometry))
        WHERE NOT (A.place=B.place AND A.statecode=B.statecode);
--12
place	statecode	containedplace	containedstatecode
Los Angeles	CA	Culver City	CA
Los Angeles	CA	Torrance	CA
Los Angeles	CA	El Segundo	CA
Los Angeles	CA	Santa Monica	CA
San Jose	CA	Santa Clara	CA
Fremont	CA	Newark	CA
Oakland	CA	Emeryville	CA
Cary	NC	Morrisville	NC
New York	NY	Jersey City	NJ
Dallas	TX	Richardson	TX
Dallas	TX	Addison	TX
Dallas	TX	Farmers Branch	TX

We could ignore, flag or discard these cites. A visual inspection suggests that Culver City, Torrence, El Segundo, Jersey City, and probably Richardson, Newark, and maybe Cary don't have any issues. Santa Monica, Santa Clara, Emeryville, Farmer's Branch and Addison do look like they have issues, but with the exception of Farmer's Branch and Addison, these are big cites and with lots of locations, so the issue should be washed out by removing outliers or otherwise appropriately choosing the clustering layer.

Intersecting All Encompassing Hulls

52 places have all encompasing hulls intersect in our data (i.e., there are 26 intersections). This includes some of the places that suffer from encapsulation (especially Santa Monica, Santa Clara, Emeryville, Farmer's Branch and Addison). So beyond encapsulated places, there are an additional 20 intersections. These are:

place	statecode	intersectedplace	intersectedstatecode
Alpharetta	GA	Roswell	GA
Bellevue	WA	Redmond	WA
Boston	MA	Cambridge	MA
Boston	MA	Somerville	MA
Campbell	CA	San Jose	CA
Centennial	CO	Greenwood Village	CO
Cupertino	CA	San Jose	CA
Fremont	CA	Newark	CA
Greenwood Village	CO	Centennial	CO
Irvine	CA	Newport Beach	CA
Los Altos	CA	Mountain View	CA
Menlo Park	CA	Redwood City	CA
Milpitas	CA	San Jose	CA
Mountain View	CA	Palo Alto	CA
Mountain View	CA	Sunnyvale	CA
Newton	MA	Wellesley	MA
Phoenix	AZ	Tempe	AZ
Redwood City	CA	San Carlos	CA
San Jose	CA	Sunnyvale	CA
Santa Clara	CA	Sunnyvale	CA

At a glance, most of these appear big or very big startup ecosystems. Accordingly, any process that deals with outliers (etc.) should address this issue.

First Estimation(s)

At this stage we have MasterLevels.txt and MasterLayers.txt as datafiles. MasterLevels.txt contains only layers corresponding to levels 0 through 12 and also has noothergeoms and avgdisthm as variables.

The questions we need to answer are: 1) Is there an agglomeration effect? 2) Which level or layer best describes a city (perhaps for a year, or perhaps over its life)?

We can just pick a level (say 25 hectares) and run a within-city regression:

. xtreg growthinv17l_f growthinv17l nosingletonl totmultitoncountl totpaircountl tothullcountl avgpairlengthl avghulldensi
> tyl avgdisthml i.year if level==6, fe cluster(placelevelid)

Fixed-effects (within) regression               Number of obs     =      5,027
Group variable: placelevelid                    Number of groups  =        198

R-sq:                                           Obs per group:
     within  = 0.4097                                         min =          3
     between = 0.8310                                         avg =       25.4
     overall = 0.5974                                         max =         37

                                                F(44,197)         =      78.20
corr(u_i, Xb)  = 0.4087                         Prob > F          =     0.0000

                              (Std. Err. adjusted for 198 clusters in placelevelid)
-----------------------------------------------------------------------------------
                  |               Robust
   growthinv17l_f |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
     growthinv17l |   .1388644   .0176074     7.89   0.000     .1041412    .1735877
     nosingletonl |   .1447935   .0402488     3.60   0.000     .0654197    .2241673
totmultitoncountl |   .0909545   .0481349     1.89   0.060    -.0039714    .1858803
    totpaircountl |   .1724367   .0383185     4.50   0.000     .0968695    .2480039
    tothullcountl |   .7120504   .0467915    15.22   0.000     .6197739    .8043269
   avgpairlengthl |  -.0219417    .023633    -0.93   0.354    -.0685478    .0246645
  avghulldensityl |    .049566   .0202756     2.44   0.015     .0095808    .0895511
       avgdisthml |   .0933327    .076309     1.22   0.223    -.0571546    .2438201

Or:

. xtreg growthinv17l_f growthinv17l numstartups numstartupssq nosinglemulti nosinglemultisq nohull nohullsq nopair nopairs
> q i.year if level==6, fe cluster(placelevelid)

Fixed-effects (within) regression               Number of obs     =      5,773
Group variable: placelevelid                    Number of groups  =        200

R-sq:                                           Obs per group:
     within  = 0.4017                                         min =          4
     between = 0.8425                                         avg =       28.9
     overall = 0.5708                                         max =         37

                                                F(45,199)         =      72.39
corr(u_i, Xb)  = 0.4222                         Prob > F          =     0.0000

                            (Std. Err. adjusted for 200 clusters in placelevelid)
---------------------------------------------------------------------------------
                |               Robust
 growthinv17l_f |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
   growthinv17l |    .220333    .018699    11.78   0.000     .1834595    .2572066
    numstartups |   .0062875   .0022944     2.74   0.007      .001763    .0108119
  numstartupssq |  -8.04e-07   1.14e-06    -0.70   0.483    -3.06e-06    1.45e-06
  nosinglemulti |   .0648575   .0168134     3.86   0.000     .0317023    .0980127
nosinglemultisq |  -.0021614   .0006336    -3.41   0.001    -.0034108    -.000912
         nohull |   .1747691   .0255105     6.85   0.000     .1244636    .2250747
       nohullsq |  -.0057148   .0012164    -4.70   0.000    -.0081136    -.003316
         nopair |   .0896908   .0248207     3.61   0.000     .0407455    .1386361
       nopairsq |  -.0097196   .0024153    -4.02   0.000    -.0144825   -.0049567


Note that the following don't work, either alone or with other variables (including numstartups and numstartupsq), probably because they are third-order effects:

. xtreg growthinv17l_f growthinv17l avghulldensity avghulldensitysq avgpairlength avgpairlengthsq avgdisthm avgdisthmsq i.
> year if level==6, fe cluster(placelevelid)

Fixed-effects (within) regression               Number of obs     =      5,027
Group variable: placelevelid                    Number of groups  =        198

R-sq:                                           Obs per group:
     within  = 0.3579                                         min =          3
     between = 0.5926                                         avg =       25.4
     overall = 0.3753                                         max =         37

                                                F(43,197)         =    2152.49
corr(u_i, Xb)  = 0.2529                         Prob > F          =     0.0000

                             (Std. Err. adjusted for 198 clusters in placelevelid)
----------------------------------------------------------------------------------
                 |               Robust
  growthinv17l_f |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
    growthinv17l |   .2668427   .0208574    12.79   0.000     .2257101    .3079752
  avghulldensity |   .0008076   .0003875     2.08   0.038     .0000433    .0015718
avghulldensitysq |  -1.14e-07   8.80e-08    -1.29   0.197    -2.87e-07    5.97e-08
   avgpairlength |  -.0018724   .0036128    -0.52   0.605    -.0089972    .0052524
 avgpairlengthsq |  -.0000296   .0000596    -0.50   0.620    -.0001471    .0000879
       avgdisthm |    .001429   .0035371     0.40   0.687    -.0055465    .0084045
     avgdisthmsq |   -.000012   .0000157    -0.76   0.447    -.0000429     .000019

We can also do it with fractions and their squares (omit fracsinglemulti). However at level 6 (25 hectare), pairs seems more important than hulls:

. xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracpair fracpairsq frachull frachullsq i.year if level==6, 
> fe cluster(placelevelid)

Fixed-effects (within) regression               Number of obs     =      5,773
Group variable: placelevelid                    Number of groups  =        200

R-sq:                                           Obs per group:
     within  = 0.3919                                         min =          4
     between = 0.8456                                         avg =       28.9
     overall = 0.5274                                         max =         37

                                                F(43,199)         =      62.34
corr(u_i, Xb)  = 0.4268                         Prob > F          =     0.0000

                          (Std. Err. adjusted for 200 clusters in placelevelid)
-------------------------------------------------------------------------------
              |               Robust
growthinv17~f |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
 growthinv17l |   .2481436   .0181447    13.68   0.000     .2123631     .283924
  numstartups |   .0100673   .0019792     5.09   0.000     .0061644    .0139702
numstartupssq |  -6.92e-06   2.03e-06    -3.41   0.001    -.0000109   -2.92e-06
     fracpair |   .7540177   .3709212     2.03   0.043     .0225772    1.485458
   fracpairsq |     -1.936   .7030942    -2.75   0.006    -3.322472   -.5495289
     frachull |   .1969853    .562807     0.35   0.727    -.9128457    1.306816
   frachullsq |  -.1491389   .3878513    -0.38   0.701    -.9139649     .615687

Whereas across all levels:

. xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracpair fracpairsq frachull frachullsq i.year, fe cluster(p
> lacelevelid)

Fixed-effects (within) regression               Number of obs     =     76,623
Group variable: placelevelid                    Number of groups  =      2,600

R-sq:                                           Obs per group:
     within  = 0.3956                                         min =          4
     between = 0.8330                                         avg =       29.5
     overall = 0.5279                                         max =         37

                                                F(43,2599)        =     827.33
corr(u_i, Xb)  = 0.4143                         Prob > F          =     0.0000

                        (Std. Err. adjusted for 2,600 clusters in placelevelid)
-------------------------------------------------------------------------------
              |               Robust
growthinv17~f |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
 growthinv17l |   .2522524   .0049725    50.73   0.000     .2425019    .2620028
  numstartups |   .0100677   .0005323    18.91   0.000     .0090239    .0111114
numstartupssq |  -6.95e-06   5.51e-07   -12.62   0.000    -8.03e-06   -5.87e-06
     fracpair |   .4152028   .0956859     4.34   0.000     .2275745    .6028311
   fracpairsq |  -.9753654   .1271631    -7.67   0.000    -1.224717   -.7260141
     frachull |  -.8606939   .1231519    -6.99   0.000     -1.10218   -.6192081
   frachullsq |    .495785   .0976557     5.08   0.000     .3042942    .6872758

This is probably because of the variation in hulls vs pairs at level 6, which has lots of cities with nothing in pairs and everything in hulls. We might want to 'control' for cityarea by restricting our within city analysis to large enough cities. A 25 hectare target area might be too encapsulating -- more than 10% of observations are 100% in hulls:

. su frachull if level==6, det

                          frachull
-------------------------------------------------------------
      Percentiles      Smallest
 1%     .2162162       .1153846
 5%     .3333333       .1153846
10%     .4285714       .1428571       Obs               6,032
25%           .6       .1428571       Sum of Wgt.       6,032

50%           .8                      Mean           .7539126
                        Largest       Std. Dev.      .2209328
75%     .9666666              1
90%            1              1       Variance       .0488113
95%            1              1       Skewness      -.6390206
99%            1              1       Kurtosis       2.400947

I tried using R2 to select levels, but only the second spec had an interior solution (at level 3):

forvalues i=1/12 {
	quietly capture xtreg growthinv17l_f growthinv17l nosinglemulti nosinglemultisq nohull nohullsq nopair nopairsq avghulldensity avghulldensitysq avgpairlength avgpairlengthsq avgdisthm avgdisthmsq i.year if level==`i', fe cluster(placelevelid)
	display "Reg: 1   level: " `i' "   r2-within: " `e(r2_w)'
}
forvalues i=1/12 {
	quietly capture xtreg growthinv17l_f growthinv17l nosinglemulti nosinglemultisq nohull nohullsq nopair nopairsq i.year if level==`i', fe cluster(placelevelid)
	display "Reg: 2   level: " `i' "   r2-within: " `e(r2_w)'
}
forvalues i=1/12 {
	quietly capture xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracsinglemulti fracsinglemultisq fracpair fracpairsq frachull frachullsq i.year if level==`i', fe cluster(placelevelid)
	display "Reg: 3   level: " `i' "   r2-within: " `e(r2_w)'
}
Reg: 2   level: 1   r2-within: .39552569
Reg: 2   level: 2   r2-within: .3998779
Reg: 2   level: 3   r2-within: .40348691
Reg: 2   level: 4   r2-within: .40130097
Reg: 2   level: 5   r2-within: .39931203
Reg: 2   level: 6   r2-within: .39707046
Reg: 2   level: 7   r2-within: .39366909
Reg: 2   level: 8   r2-within: .38957831
Reg: 2   level: 9   r2-within: .38398108
Reg: 2   level: 10   r2-within: .37662604
Reg: 2   level: 11   r2-within: .36999057
Reg: 2   level: 12   r2-within: .38393843

However, doing this by exgroup (0 t0 3), gives the same result - level 3 - for each exgroup.

Alternative approaches are to use AIC/BIC, or maybe entropy. For the same set of variables, in the same model, AIC/BIC are minimized when R2 is maximized, they are only useful when choosing the combination of the variables/estimation and the level. And it seems we can only do entropy one variable at a time:

entropyetc nohull if level==1

Other

See also:

Old Work Using Circles

Very Old Summary

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

Description

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

Regression equation.png

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument. The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

Data

Making the circle input data

Ed's additional datawork is in

Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessingCoLevelSimple.sql

The key table for circle processing is CoLevelBlowout, which is restricted (to include cities with greater than 10 active at some point in the data) to make CoLevelForCircles.

We need to:

  1. Winsorize CoLevelBlowout
  2. Compute the circles!
  3. Make the Bay Area (over time) data
  4. Plot the Bay Area data (with colors per Bay Area city) for 1985 to present
  5. Combine the plots to make an animated gif

To winsorize the data we need the formula for Great Circle Distance. The radius of the earth is 6,378km (half of diameter: 12,756 km). So:

GCD = acos( sin(lat1) x sin(lat2) + cos(lat1) x cos(lat2) x cos(long1-long2) ) x r

Main Sources

The primary sources of data for this project are:

  • SDC VentureXpert - from VC Database Rebuild, the key table is
  • GIS City Data
  • Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in Hubs.

VC data

Data on the number of new vc backed firms in each city and year is in:

Z:\Hubs\2017\clean data
The name of the file is firm_nr.txt.

Database is cities SQL script is: nr_firms.sql

Raw data is in:

Z:\VentureCapitalData\SDCVCData\vcdb2
The file is colevelsimple.txt

In order to see if there are outliers, I get the average coordinates for all cities and find the differences of the firm's coordinates from the city coordinate. The script for the average city coordinates is in

Z:\Hubs\2017\sql scripts and the file name is newcolevel.sql.

The differences are taken in excel. The file containing the differences is in

Z:\Hubs\2017 and the file name is new_colevel.txt.
  • Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is circles.txt. (It contains only 106 observations)

Database is cities SQL script is: circles.sql

The script for joining the two tables on the VC table is in:

Z:\Hubs\2017\sql scripts
 The name of the file is new_firm_nr_circles.sql
  • We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is CitiesWithGT10Active.txt

The script for joining the final data with this file is located in

Z:\Hubs\2017\sql scripts
The file name is final_joined_kerda.sql.

The final data is in

Z:\Hubs\2017\clean data
The file name is new_final_kerda.txt.

Accelerator data

Accelerators data is in

Z:\Hubs\2017\clean data
The file name is accelerators.txt
The table is accelerators

The joined accelerators data with the VC table is in joined_accelerators table. The script is in

Z:\Hubs\2017\sql scripts
The file name is join_accelerators.sql

The do file is in

Z:\Hubs\2017\kerda
The name is agglomeartion_kerda.do

It includes the graphs, tables and the preliminary FE regressions with VC funding amount and growth rate. It also predicts the hazard rates, matches on the hazard rate in order to create synthetic control and treatment groups. What is left to do is to add 2 lagged and 3 forward observations for the cities which do have a match. Remove the overlapping observations for the years that get a treatment but which at the same time serve as a control.

See also

Also:


Unbiased measure

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

GIS Resources

Useful functions for spatial joins

sum(expression): aggregate to return a sum for a set of records
count(expression): aggregate to return the size of a set of records
ST_Area(geometry) returns the area of the polygons
ST_AsText(geometry) returns WKT text
ST_Buffer(geometry, distance): For geometry: Returns a geometry that represents all points whose distance from this Geometry is less than or equal to distance. Calculations are in the Spatial Reference System of this Geometry. For geography: Uses a planar transform wrapper.
ST_Contains(geometry A, geometry B) returns the true if geometry A contains geometry B
ST_Distance(geometry A, geometry B) returns the minimum distance between geometry A and geometry B
ST_DWithin(geometry A, geometry B, radius) returns the true if geometry A is radius distance or less from geometry B
ST_GeomFromText(text) returns geometry
ST_Intersection(geometry A, geometry B): Returns a geometry that represents the shared portion of geomA and geomB. The geography implementation does a transform to geometry to do the intersection and then transform back to WGS84
ST_Intersects(geometry A, geometry B) returns the true if geometry A intersects geometry B
ST_Length(linestring) returns the length of the linestring
ST_Touches(geometry A, geometry B) returns the true if the boundary of geometry A touches geometry B
ST_Within(geometry A, geometry B) returns the true if geometry A is within geometry B
geometry_a && geometry_b: Returns TRUE if A’s bounding box overlaps B’s.
geometry_a = geometry_b: Returns TRUE if A’s bounding box is the same as B’s.
ST_SetSRID(geometry, srid): Sets the SRID on a geometry to a particular integer value.
ST_SRID(geometry): Returns the spatial reference identifier for the ST_Geometry as defined in spatial_ref_sys table.
ST_Transform(geometry, srid): Returns a new geometry with its coordinates transformed to the SRID referenced by the integer parameter.
ST_Union(): Returns a geometry that represents the point set union of the Geometries.
substring(string [from int] [for int]): PostgreSQL string function to extract substring matching SQL regular expression.
ST_Relate(geometry A, geometry B): Returns a text string representing the DE9IM relationship between the geometries.
ST_GeoHash(geometry A): Returns a text string representing the GeoHash of the bounds of the object.

Native functions for geography

ST_AsText(geography) returns text
ST_GeographyFromText(text) returns geography
ST_AsBinary(geography) returns bytea
ST_GeogFromWKB(bytea) returns geography
ST_AsSVG(geography) returns text
ST_AsGML(geography) returns text
ST_AsKML(geography) returns text
ST_AsGeoJson(geography) returns text
ST_Distance(geography, geography) returns double
ST_DWithin(geography, geography, float8) returns boolean
ST_Area(geography) returns double
ST_Length(geography) returns double
ST_Covers(geography, geography) returns boolean
ST_CoveredBy(geography, geography) returns boolean
ST_Intersects(geography, geography) returns boolean
ST_Buffer(geography, float8) returns geography [1]
ST_Intersection(geography, geography) returns geography [1]

Functions for Linear Referencing

ST_LineInterpolatePoint(geometry A, double measure): Returns a point interpolated along a line.
ST_LineLocatePoint(geometry A, geometry B): Returns a float between 0 and 1 representing the location of the closest point on LineString to the given Point.
ST_Line_Substring(geometry A, double from, double to): Return a linestring being a substring of the input one starting and ending at the given fractions of total 2d length.
ST_Locate_Along_Measure(geometry A, double measure): Return a derived geometry collection value with elements that match the specified measure.
ST_Locate_Between_Measures(geometry A, double from, double to): Return a derived geometry collection value with elements that match the specified range of measures inclusively.
ST_AddMeasure(geometry A, double from, double to): Return a derived geometry with measure elements linearly interpolated between the start and end points. If the geometry has no measure dimension, one is added.

3-D Functions

ST_3DClosestPoint — Returns the 3-dimensional point on g1 that is closest to g2. This is the first point of the 3D shortest line.
ST_3DDistance — For geometry type Returns the 3-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.
ST_3DDWithin — For 3d (z) geometry type Returns true if two geometries 3d distance is within number of units.
ST_3DDFullyWithin — Returns true if all of the 3D geometries are within the specified distance of one another.
ST_3DIntersects — Returns TRUE if the Geometries “spatially intersect” in 3d - only for points and linestrings
ST_3DLongestLine — Returns the 3-dimensional longest line between two geometries
ST_3DMaxDistance — For geometry type Returns the 3-dimensional cartesian maximum distance (based on spatial ref) between two geometries in projected units.
ST_3DShortestLine — Returns the 3-dimensional shortest line between two geometries

Relevant PostgreSQL Commands

\dt *.* Show all tables
\q Exit table

Specifities/ Outliers to consider

New York (decompose)
Princeton area (keep Princeton  unique)
Reston, Virginia (keep)
San Diego (include La Jolla)
Silicon Valley (all distinct)

To make a circle

SELECT ST_Buffer([desired point], [desired radius], 'quad_segs=8') 
FROM [desired table]

quad_segs=8 indicates circle

CirclePostGIS.png

For more precision in circle:

SELECT ST_Transform(geometry( 
    ST_Buffer(geography( 
        ST_Transform( [desired point], 4326 )), 
            [desired radius]')), 
            900913) FROM [desired table]

4326 and 900913 represent particular precision.

Decimal Degrees

We are working with longitude and latitude in decimal degrees. See https://en.wikipedia.org/wiki/Decimal_degrees

When converting radius to km, multiply by 111.3199. For area, multiple by (111.3199)^2=12,392.12013601.

Census Data

Population

The Census Gazetteer files for 2010, 2000 and 1990 can give use population by census place. See https://www.census.gov/geo/maps-data/data/gazetteer.html

The places file contains data for all incorporated places and census designated places (CDPs) in the 50 states, the District of Columbia and Puerto Rico as of the January 1, 2010. The file is tab-delimited text, one line per record. Some records contain special characters.
Download the National Places Gazetteer Files (1.2MB)
Download the State-Based Places Gazetteer Files:
Column	Label	Description
Column 1	USPS	United States Postal Service State Abbreviation
Column 2	GEOID	Geographic Identifier - fully concatenated geographic code (State FIPS and Place FIPS)
Column 3	ANSICODE	American National Standards Insititute code
Column 4	NAME	Name
Column 5	LSAD	Legal/Statistical area descriptor.
Column 6	FUNCSTAT	Functional status of entity.
Column 7	POP10	2010 Census population count.
Column 8	HU10	2010 Census housing unit count.
Column 9	ALAND	Land Area (square meters) - Created for statistical purposes only.
Column 10	AWATER	Water Area (square meters) - Created for statistical purposes only.
Column 11	ALAND_SQMI	Land Area (square miles) - Created for statistical purposes only.
Column 12	AWATER_SQMI	Water Area (square miles) - Created for statistical purposes only.
Column 13	INTPTLAT	Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively
Column 14	INTPTLONG	Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude respectively.

Relationships

See https://www.census.gov/geo/maps-data/data/relationship.html

These text files describe geographic relationships. There are two types of relationship files; those that show the relationship between the same type of geography over time (comparability) and those that show the relationship between two types of geography for the same time period.

ACS (American Community Survey) Data

Steps to download:

1) Go to https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml
2) Select 'I know the dataset or table(s) that I want to download.'
3) Press Next
4) For 'Select a program:' choose
       'American Community Survey'
5) For 'Select a dataset and click Add to Your Selections:' choose
       '<YEAR OF INTEREST> ACS 1-year estimates'
6) Press 'Add To Your Selections'
7) Press Next
8) For 'Select a geographic type:' choose
       'Place - 160'
9) For Select a state:
       Don't choose a state, as we wish to download all.
10) For 'Select one or more geographic areas...' choose
       'All Places within United States and Puerto Rico'
11) Press Next

Other

Counts of firms by NAICS code at the county level may be useful: https://www2.census.gov/geo/pdfs/education/cbp12gdbs.pdf

Tax Increment Finance Zones