Changes

Jump to navigation Jump to search
no edit summary
*[[Restoring vcdb3]]
===CoLevelForCircles===
Note: Make sure that the geocoding has been fixed first! See [[Restoring_vcdb3#Fix_the_Geocoding]]. Note that we are working with decimal degrees to six decimal places from Google Maps, which is equivalent to a staggering 11cm of accuracy at the equator. See http://wiki.gis.com/wiki/index.php/Decimal_degrees.
*colevelblowout,PlacesWithGT10Active -> '''CoLevelForCircles 171170'''
===HCA===
Take CoLevelForCircles.txt and feed it into the HCA script in
Take results.tsv and load it as the HCL table.
===Layers and levels===
See Agglomeration.sql for the following build:
For the next steps on the data see [[Jeemin Sim (Work Log)]]. This includes details of how to load the TIF data.
===Approach===
We want to choose some layers to work with. https://en.wikipedia.org/wiki/Hierarchical_clustering notes that "One can always decide to stop clustering when there is a sufficiently small number of clusters '''(number criterion)'''. Some linkages may also guarantee that agglomeration occurs at a greater distance between clusters than the previous agglomeration, and then one can stop clustering when the clusters are too far apart to be merged '''(distance criterion)'''."
*Reasonable exclusions
====Discarding Outliers====
We don't need to discard outliers, per se, just find a layer where outliers are singletons. A wrong approach is to take the highest layer with a single hull (or two hulls or three hulls, etc.). It is fair that if a layer never has a hull, then presumably it only has a single location or a line of locations (note that it is possible for a line to have more than 2 locations both because of multitons and because of perfect alignment, given our Google Maps accuracy), so we can discard it. However, this approach will find when there is just one hull left, rather than the last time that there is one hull in decomposition.
It is worth noting that the highest1hulllayer occurs on average at around 21.4% unclustered (with std dev. of 20.2%). These percentages go down alightly for highest2hulllayer and highest3hulllayer because cities that have 2 or 3 (or more) hulls have larger ecosystems and so more layers.
====Elbow on fraction of locations in hulls====
[[File:AgglomerationFLHGraph.png|right|300px]] The '''elbowcalc''' and '''elbowdata''' queries provide the data. '''elbowdata''' takes layer/finallayer (i.e., fraction unclustered, as the layer 1 is the all encompassing hull and final layer is the raw locations), rounds it to two digits, and then calculates the average fraction of locations in hulls and the average hull area fraction of all encompassing hull area. The former gives a nice curve with an elbow (found by taking the second derivative and setting it equal to zero) at x=0.40237.
We then identify the layer that is closest to having a fraction of locations in hulls of 0.40237, taking the lower level (i.e., the more clustered level) whenever there is a tie. The resulting indicator variable is called '''elbowflhlayer''' and is made in table '''Elbowflh'''. This is analyzed in a sheet in "Images Review.xlsx" in E:\projects\agglomeration.
====Fraction of Maximum Hull Area in Hulls====
[[File:AgglomerationFHAGraph.png|right|300px]] We also tried computing the fraction of the maximum hull area (MHA) in covered by hulls for each layer. The maximum hull area is on layer one, when every location is in an all-encompassing hull. We excluded data from layer one as well the final layer because they lead small data issues.
A cubic was a mediocre fit to this data, giving an R2 of 83% but with lots of deviation concentrated right around the local minimum ({-0.0224722, {x -> 0.446655}} [https://www.wolframalpha.com/input/?i=minimum+-2.3595x%5E3+%2B+4.3803x%5E2+-+2.5008x+%2B+0.4309], point of inflection and local maximum. A quartic had an R2 of 90% at around x=0.44 (6.408 x^4 - 15.176 x^3 + 12.592 x^2 - 4.3046 x + 0.517≈0.00825284 at x≈0.440275). I tried a quintic and it had inflection points are x=0.33, 0.55, and 0.82, as well as local maxima at 0.39 and 0.90. Visually there seems to be something going on in the 20% to 40% uncovered range too, perhaps a bifurcation of results, which might be due to rounding issues.
====Reasonable Exclusions====
We started by including all U.S. cities that received at least $10m of growth venture capital in a year between 1980 and 2017 (inclusive). This gave us a list of 200 cities. However, we still have a lot of city-years with low number of startups.
But everywhere (i.e., all 200 places) have 10 or more layers at some point in time. And everywhere has at least 6 years with 6 or more observations. Detroit has just 7 obs that meet this criteria, half the number of Germantown, MD and a third of Greenwood Village, CO.! Requiring a year to have six observations would reduce us to 4916 observations from 6702 (i.e., down to 73% of the data). Requiring 9 would reduce the data down to 3889 obs (58%), and we'd lose more observations as places wouldn't have enough to form a time-series. The answer then appears to be to limit to observations with 6 or more layers. We'll code the number of layers, and the max and min number of layers for a place, into the data.
====Maximum R-squared====
[[File:Portland3HullsHighest.png|right|300px]] Using a maximum R-squared approach to find the 'best layer' for a city is inherently problematic. A city might have 5 layers in 1980 and 80 layers in 2017, and so using layer 40, say, irrespective of year is somewhat meaningless. There are several alternative that make more sense. One is to use the fraction unclustered, much like with the elbow approach. The other is to find the layer with a certain hull count (or as close to it as possible). Hulls might tend to be somewhat stable over time, so three hulls in Portland in 2017 will be centered in more or less the same place as three hulls in Portland in 2003. This turns out to be somewhat true, as seen in the image on the right, which uses the last time (highest level) that there are three hulls, or two for 1998 and 1993 (one of which is out of frame). One issue with this approach is that the highest level with a certain hull count is that hulls almost always contain just three points.
quietly capture reg growthinv17lf pc1 pc2 pc3 if placeid==`placeid' & numclusters==`clusters' & lowesthighestflag==1 & year>=1995 & year <.
=====Issues and Solution=====
There are two issues. Why are we using a PCA? Just to get the number of regressors down? The dimensionality isn't that high. And more importantly, one or more PCA components may be picking up a scale effect. We don't want to use the scale regressors in R2 estimation, because they might drive the R2.
*The large set results look more interior and stable than before... the cutoff of 12 looks reasonable too.
=====Revisiting Portland=====
[[File:Portland4HullsLowestHighest.png|right|300px]] Portland doesn't have 4 clusters for any year before 2000, or for 2007 and 2009. For 5 year multiples the layers are as follows:
The resulting map has much more adjacency than overlap. Measuring the nearest hull edge and center distance for each hull in a year to each hull in the next year and averaging would compute two measures of hull persistence. The overlap area from year to year, either in total or as a fraction of the second year's (or smaller years) total area, would provide another measure of persistence.
=====What do we want to know?=====
So now we have 200 (ish) cities with their optimally selected hulls (we chose the best hull count that is constant from 1995 to 2017 using the lowest-highest occurrence of that count). And now we'd like to know:
*Whether having hulls closer together is associated with growth, controlling for size. '''We should put these layer in the list to build avghulldisthm and avgdisthm (see line 1335).'''
=====Houston, TX=====
We also want to know about Houston, TX.
Supposing that all of Houston's 43 active startups were relocated, it doesn't much matter where you put them. One questions is whether the 4 sq mile innovation corridor would then be an improvement over the status quo, and how much worse it would be than a district that of the implied optimum density? Such a district, using 0.95 hectares per location, would have an implied hull area of around 45 hectares, or a 0.003011 decimal degrees deviation in four directions from a point (to give 4 corners of a square).
====Group Means Regression====
Once we have found optimum hull specifications within a city, they will not vary, or will vary very little, over time. We therefore want to use a between panel regression, also called a group means regression. See the following:
*The definition in https://www.stata.com/manuals13/xtxtreg.pdf
===Image Analysis===
====Building Images====
Use B&W:
Town: Blue, 75% tranparent
====Working with ArcPy====
First version saved as E:\projects\agglomeration\Test.mxd
*Data access module: http://desktop.arcgis.com/en/arcmap/10.3/analyze/arcpy-data-access/what-is-the-data-access-module-.htm
====Analyzing the results====
The following issues became apparent (Counts out of 191 cities with 4 or more locations in 2017 and greater than $10m inv in a year over all time):
A visual inspection suggests that Stamford and Norwalk might be better combined but don't really matter. Minneapolis and St. Paul are pretty separate and really separate after removing outliers. Rarleigh and Durham are completely separate (Cary is more of an issue), as are Dallas and Fort Worth and SF and Oakland.
=====Encapsulation=====
The data suggests that there are 12 places that encapsulated by 7 other places:
We could ignore, flag or discard these cites. A visual inspection suggests that Culver City, Torrence, El Segundo, Jersey City, and probably Richardson, Newark, and maybe Cary don't have any issues. Santa Monica, Santa Clara, Emeryville, Farmer's Branch and Addison do look like they have issues, but with the exception of Farmer's Branch and Addison, these are big cites and with lots of locations, so the issue should be washed out by removing outliers or otherwise appropriately choosing the clustering layer.
=====Intersecting All Encompassing Hulls=====
52 places have all encompasing hulls intersect in our data (i.e., there are 26 intersections). This includes some of the places that suffer from encapsulation (especially Santa Monica, Santa Clara, Emeryville, Farmer's Branch and Addison). So beyond encapsulated places, there are an additional 20 intersections. These are:
At a glance, most of these appear big or very big startup ecosystems. Accordingly, any process that deals with outliers (etc.) should address this issue.
===First Estimation(s)=== '''Note that this subsection is now very out of date!'''
At this stage we have MasterLevels.txt and MasterLayers.txt as datafiles. MasterLevels.txt contains only layers corresponding to levels 0 through 12 and also has noothergeoms and avgdisthm as variables.
entropyetc nohull if level==1
==TIF work== See:*[[TIF Project]]*[[Jeemin Sim (Work Log)]] Install ogr2ogr on mother: apt install gdal-bin ===Chicago=== Starting the process with Chicago. Do the following:*Go to: https://data.cityofchicago.org/Community-Economic-Development/Boundaries-Tax-Increment-Financing-Districts/fz5x-7zak*Save the KML as E:\projects\agglomeration\TIF\Chicago.KML*Load the KML into the dbase using ogr2ogr [https://gdal.org/programs/ogr2ogr.html], note the the nln option creates a new layer (table)*Note that you have to hit refresh on the dbase (or at least the table list) in DataGrip to get the new table to show up  ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" Chicago.kml -nln tifchicago The table tifchicago corresponds to the XML below, except that it has some extra fields. The index field seems to be autogenerated, and tessellate, extrude and visibility are always -1,0, and -1 respectively. All other fields are blank, including description, timestamp, begin, and end (note that you can query end with "end" in postgres): SELECT ogc_fid, name, description, timestamp, begin, "end", altitudemode, tessellate, extrude, visibility, draworder, icon FROM tifchicago; The KML is an XML file with meta data (including an approval date, an expiration date, and a name, as well as some other values, some of which could be derived from the geometry) and then a set of points that describe the outer ring of a polygon: <Placemark> <styleUrl>#defaultStyle</styleUrl> <name>Northwest Industrial Corridor</name> <ExtendedData> <Data name="approval_d"><value>12/2/1998</value></Data> <Data name="comm_area"><value>19,20,23,25,26,27</value></Data> <Data name="expiration"><value>12/2/2021</value></Data> <Data name="ind"><value>Industrial</value></Data> <Data name="name_trim"><value>Northwest Industrial Corridor</value></Data> <Data name="objectid"><value>0</value></Data> <Data name="objectid_1"><value>107</value></Data> <Data name="ref"><value>T- 64</value></Data> <Data name="repealed_d"><value></value></Data> <Data name="sbif"><value>Y</value></Data> <Data name="shape_area"><value>51402231.0322</value></Data> <Data name="shape_leng"><value>80417.4828932</value></Data> <Data name="show"><value>1</value></Data> <Data name="type"><value>Existing</value></Data> <Data name="use"><value>Industrial</value></Data> <Data name="wards"><value>27,28,30,31,37</value></Data> </ExtendedData> <MultiGeometry> <Polygon> <outerBoundaryIs> <LinearRing> <coordinates> -87.74541914577178,41.92534327389125 -87.74541599224814,41.92516215074396 </coordinates> </LinearRing> </outerBoundaryIs> </Polygon> </MultiGeometry> </Placemark>  ==Other===
See also:
*[[The Impact of Entrepreneurship Hubs on Urban Venture Capital Investment]]
*[[TIF Project]]
=Old Work Using Circles=

Navigation menu