Changes

Jump to navigation Jump to search
#I've rechecked the code and I now think it is computationally feasible. What I was trying to do before was find the average distance between every set of coordinates, which is an order more complex than what we need to do to calculate even between-group variance (within and total are simpler). Think O(n) rather than O(n^2), and we have around ~20million statistical clusters spread over ~200k layers. Estimate, given (1) above: 2hrs.
=====Non-trivial The "Right" Implementation=====
In our contextI could implement the elbow method using scalar distances. Specifically. I could measure <math>\bar{Y}_{i\cdot} - \bar{Y}</math> as the distance between a cluster's centroid and the overall centroid, and measure <math>Y_{ij}-\bar{Y}_{i\cdot}</math> as the distance between a cluster's constituent location and its centroid. There's a sense in which this approach is "the right thing to do", and the distance measurements are pretty straight-forward in PostGIS (and would account for an elipsoid earth).  However, in actuality, we have a vectors of locations <math>(a,b)</math> of locations, and not a scalar for each pointdistances as fundamental inputs. This changes the math[https://online.stat.psu.edu/stat505/lesson/8/8.2], as well as the ultimate 'estimator'test statistic [https://online.stat.psu.edu/stat505/lesson/8/8.3] that we might use.
Specifically, <math>Y_{ij}</math> is a vector: <math>\mathbf{Y_{ij}} = \begin{pmatrix} Y_{ija} \\ Y_{ijb}\end{pmatrix}</math>, as are the sample mean <math>\mathbf{Y_{i\cdot}} = \begin{pmatrix} \bar{Y}_{i\cdot a}\\ \bar{Y}_{i\cdot b} \end{pmatrix}</math> and the grand mean <math>\mathbf{Y} =
The '''T''' sum of squares and cross products is <math>\mathbf{T=H+E}</math>.
This is all potentially problematic because relational databases don't naturally do linear algebra very well. And we likely don't want to do this in some other software because for two reasons. First, we'd have to move the data out and back into the database, which is costly. And, and second the other software isn't likely to handle an elipsoid earth and the coordinates are not actually on a flat plane. (Though, assuming a flat earth might be reasonable as most cities are sufficiently small that curvature won't materially affect results, no matter how conceptually offensive it is... )
We could calculate the elements of '''H''' and '''E''' separatelyin PostGIS, taking advantage of ST_Distance. That wouldn't be too bad as they the matrices are only 2x2s. However, there might it's worth asking whether that would be a shortcut available"right".
Let's look at how '''H''' captures information about 'distance'(ignoring flat earth issues). Denote <math>\mathbf{\bar{Y}_{i\cdot}} - \mathbf{\bar{Y}} = \begin{pmatrix} \Delta A \\ \Delta B\end{pmatrix}</math>. Then:
:<math>
</math>
Thus the trace is the square of the distance between points (for H, between a point within a cluster and the cluster's centroid), which is super easy to calculate without carrying around all the elements of the matrix. The trace of a product is the product of the traces only under specific circumstances and not in general, though we likely meet those circumstances (no zero elements on the diagonal, or no points exactly on the centroids).
=====Standard Estimators=====

Navigation menu