Changes

Jump to navigation Jump to search
We could normalize the number of clusters, dividing it by the maximum, to deal with the 'cities are different' problem. That is, we could put %unclustered (later called %complete) on the x-axis and %variance explained on the y-axis and fit a curve to a plot of city-year-layers. We could then pick a %unclustered value and apply it across cities. The difference between this and the 'heuristic method' is that we'd be choosing based on diminishing marginal returns in variance explained as opposed to in percentage locations in hulls.
'''Addendum: ''' #We could do the elbow method on a per city-year basis. The number of statistical clusters is equal to the number of layers, so we'd be indexing over layers, and selecting a layer, for each city-year. It might be worth trying this for some city-year, say Tulsa, 2003. The code would be reusable for a bigger sample. Estimate: 3hrs.#I've rechecked the code and I now think it is computationally feasible. What I was trying to do before was find the average distance between every set of coordinates, which is an order more complex than what we need to do to calculate even between-group variance (within and total are simpler). Think O(n) rather than O(n^2), and we have around ~20million statistical clusters spread over ~200k layers. Estimate, given (1) above: 2hrs.
====The Heuristic Method Justification====

Navigation menu