Changes

Jump to navigation Jump to search
#A new heatmap or two, based on a different location.
====Implementing the '''Real Elbow Method'''==== I calculated the between and within-cluster variances, as described below, using the Euclidean distance by using the ST_Distance function on PostGIS geographies (i.e., accounting for an ellipsoid earth using reference system WGS1984).  The output of the python HCL clustering script has around 40m observations (place-statecode, year, layer, cluster, startup), and some of the intermediate tables took several minutes to build. As the process should be O(n), this process could accommodate data that is perhaps 100x to 1000x bigger, assuming a patient researcher. That would put an upper-bound at around 40b observations, as the hardware/software that we are running this on is pretty close to the (current) frontier. =====Fixing an issue===== The within-cluster variance (and so F-stat and variance explained) revealed an issue with the data that had to be fixed: The Python HCA script forces the decomposition of multitons into singletons at the end of its run! We want to stop the HCA when we have every location in a separate point, rather than artificially forcing startups with the same location into separate points. This issue likely doesn't affect the maximum R2 method, but does affect the heuristic method(s) that rely on layer indices.   ====Fixing the layer index====
I checked the implementation of the % layer index again, and fixed a mistake in it.

Navigation menu