Changes

Jump to navigation Jump to search
===Heuristic Layer===
[[File:AgglomerationInflectionScatterPlotAllDataAgglomerationInflectionScatterPlotAllDataCircles.png|500px|right]] I had previously calculated the heuristic layer by calculating the mean fracinhull (i.e., % of startups in economic clusters) for each percentage of the layer index (i.e., for 101 observations) and then fitting a cubic to it. I did this because excel can't handle fitting a cubic to the full data (i.e., all 148,556 city-year-layers). However, it is incorrect - I should have used the median fracinhull, and even that would have been slightly wrong because of orthogonality issues in calculating mean square distances. So I redid the plot using all the data, and calculated the cubic in STATA instead. See: '''inflection.do''' and '''inflection.log'''.
The old result is in [https://www.edegan.com/wiki/Urban_Start-up_Agglomeration_and_Venture_Capital_Investment#Fixing_an_issue Fixing an issue] below, and is x≈0.483879. The corrected result is x≈0.487717 (note that R2 has dropped to 92.43%):
I also calculated an '''inflectionlayer''' (as opposed to the heurflhlayer, where flh stands for fraction of locations in hulls, described above). This inflectionlayer is '''the first time''' that the second central difference in the '''share of startups in economic clusters''' switches sign. It is only possible to calculate this when there are at least 4 data points, as the central difference requires data from layer-1, layer and layer+1, and we need two central differences. The variable is included in dataset (and so do files, etc.) version 3-4 forwards.
However, the inflectionlayer is really meaningless. The sign of the second central switches back and forward due to integer effects and I can't find a straight forward algorithm to pick the "correct" candidate from the set of results. Picking the '''first one''', which I currently pick, is completely arbitrary. There are a bunch of examples of the curves and the issue(s) in Results3-4.xlsx sheet 'Inflection'. I expect that if I put a bunch of time into this I could come up with some change thresholds to rule candidate answers in or out, but even then this isn't a good method. The individual curves are just way too noisy. Using the heuristic result above solve this noise problem.
Ultimately, the individual city-year curves (i.e., across layers within a city-year) are just way too noisy. A variant of this noise problem is what makes the elbow method so problematic, but the noise is even worse with the inflection method. Using the heuristic result above solves this noise problem by aggregating city-years together. One complaint made about the heuristic results is that it is near the middle (i.e., it's 48.7717%, which happens to be near 50%). Although the nature of any HCA on geographic coords implies that the result is unlikely to the close to the bounds (0 or 100%) and more likely to be near the middle (50%), it could be in an entirely different place. '''This result (i.e., the heuristic layer at 48.7717%) characterizes the agglomeration of venture-backed startup firms'''. You'd get a very different number if you studied gas stations, supermarkets, airports, or banana plantations!
{{Colored box|title=The Case for the Heuristic Method|content=The heuristic method (i.e., using the inflection in the plot from the population of city-year-layers) finds pretty much the same layer as the R2 method with almost no work, and it can be used in a within-city analysis without having to hold hull count constant.}}

Navigation menu