Changes

Jump to navigation Jump to search
#I've rechecked the code and I now think it is computationally feasible. What I was trying to do before was find the average distance between every set of coordinates, which is an order more complex than what we need to do to calculate even between-group variance (within and total are simpler). Think O(n) rather than O(n^2), and we have around ~20million statistical clusters spread over ~200k layers. Estimate, given (1) above: 2hrs.
=====Harder than it looksNon-trivial Implementation=====
In our context, we have a vectors <math>(a,b)</math> of locations, not a scalar for each point. This changes the math[https://online.stat.psu.edu/stat505/lesson/8/8.2], as well as the ultimate 'estimator'[https://online.stat.psu.edu/stat505/lesson/8/8.3] that we might use.
Specifically, <math>Y_{ij}</math> is a vector: <math>\mathbf{Y_{ij}} = \begin{pmatrix} Y_{ija} \\ Y_{ijb}\end{pmatrix}</math>, as are the sample mean <math>\mathbf{Y_{i\cdot}} = \begin{pmatrix} \bar{Y}_{i\cdot a}\\ \bar{Y}_{i\cdot b} \end{pmatrix}</math> and the grand mean <math>\mathbf{Y} =
\begin{pmatrix} \bar{Y}_{a}\\ \bar{Y}_{b} \end{pmatrix}</math>.
 
So the between-variance, which is called the '''H'''ypothesis sum of squares and cross products, is a 2x2 matrix and has K-1 degrees of freedom:
 
:<math>
H=\sum_{i=1}^{K} n_i(\mathbf{\bar{Y}_{i\cdot}} - \mathbf{\bar{Y}})(\mathbf{\bar{Y}_{i\cdot}} - \mathbf{\bar{Y}})'
</math>
 
And the within-variacnce, called the '''E'''rror sum of squares and cross products, is a 2x2 matrix and has N-K degrees of freedom:
 
:<math>
E=\sum_{i=1}^{K}\sum_{j=1}^{n_{i}} ( \mathbf{Y_{ij}}-\mathbf{\bar{Y}_{i\cdot}} )( \mathbf{Y_{ij}}-\mathbf{\bar{Y}_{i\cdot}} )'
</math>
 
The '''T''' sum of squares and cross products is <math>\mathbf{T=H+E}</math>.
 
This is all potentially problematic because relational databases don't naturally do linear algebra very well. And we likely don't want to do this in some other software because we'd have to move the data out and back into the database, which is costly, and the other software isn't likely to handle an elipsoid earth and the coordinates are not on a flat plane. Though, assuming a flat earth might be reasonable as most cities are sufficiently small that curvature won't materially affect results...
 
We could calculate the elements of '''H''' and '''E''' separately. That wouldn't be too bad as they are only 2x2s. However, there might be a shortcut available.
 
Let's look at how '''H''' captures information about 'distance'. Denote <math>\mathbf{\bar{Y}_{i\cdot}} - \mathbf{\bar{Y}} = \begin{pmatrix} \Delta A \\ \Delta B\end{pmatrix}</math>. Then:
 
:<math>
\mathbf{H}=\sum_{i=1}^{K} n_i \left ( \begin{pmatrix} \Delta A^2 & \Delta A \Delta B \\ \Delta B^2 & \Delta A \Delta B\end{pmatrix} \right)
</math>
 
So, the determinant and trace of '''H''' are:
 
:<math>
\det(\mathbf{H}) = \Delta A^2 \Delta B^2 - (\Delta A \Delta B)^2 \quad \mbox{and} \quad Tr(\mathbf{H}) = \Delta A^2 + \Delta B^2
</math>
 
Thus the trace is the square of the distance between points (for H, between a point within a cluster and the cluster's centroid), which is super easy to calculate without carrying around all the elements of the matrix. The trace of a product is the product of the traces only under specific circumstances and not in general, though we likely meet those circumstances (no zero elements on the diagonal, or no points exactly on the centroids).
 
=====Standard Estimators=====
 
We'd like to use some measure of variance explained, but variance is now a matrix. So, instead, we should probably turn to a standard estimator for MANOVA, which is the direct equivalent. These are:
Wilk's Lambda: <math>\Lambda^* = \frac{\det \mathbf{E}}{\det(\mathbf{H}+\mathbf{E})}</math>
 
Hotelling-Lawley Trace: <math>T = Tr(HE^{-1})</math>
 
Pillai Trace: <math>V = Tr(H(H+E)^{-1})</math>
 
So, we if we calculate scalar distances...
====The Heuristic Method Justification====

Navigation menu