Changes

Jump to navigation Jump to search
I also wanted to fix confusion between CSAs (Combined Statistical Areas)[https://en.wikipedia.org/wiki/Combined_statistical_area] and CMSAs (Consolidated Metropolitan Statistical Areas)[https://www2.census.gov/geo/pdfs/reference/GARM/Ch13GARM.pdf]. CMSA redirects to CSA on Wikipedia. However, it is actually not clear if these are the same things. OMB is the originator of both terms[https://www.census.gov/programs-surveys/metro-micro/about/masrp.html].
====Implementing The Elbow Method==== This section explores whether we could implement the '''actual''' elbow method (see https://en.wikipedia.org/wiki/Elbow_method_(clustering)). The elbow method plots the number of clusters (on x) against the percentage of variance explained (on y) and finds the elbow. The elbow is the point at which the "diminishing returns [in variance explained] are no longer worth the additional cost [of adding another cluster]'. For the variance explained there are two main options:  #Variance explained = between-group variance / total variance#Variance explained = between-group variance / within-group variance (Note that this is the ANOVA F-statistic). From Wikipedia: :https://en.wikipedia.org/wiki/F-test The F-test] in one-way analysis of variance is used to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other. The "explained variance", or "between-group variability" is :<math>\sum_{i=1}^{K} n_i(\bar{Y}_{i\cdot} - \bar{Y})^2/(K-1)</math> where <math>\bar{Y}_{i\cdot}</math> denotes the [[average|sample mean]] in the ''i''-th group, <math>n_i</math> is the number of observations in the ''i''-th group,<math>\bar{Y}</math> denotes the overall mean of the data, and <math>K</math> denotes the number of groups. The "unexplained variance", or "within-group variability" is :<math>\sum_{i=1}^{K}\sum_{j=1}^{n_{i}} \left( Y_{ij}-\bar{Y}_{i\cdot} \right)^2/(N-K),</math> where <math>Y_{ij}</math> is the ''j''<sup>th</sup> observation in the ''i''<sup>th</sup> out of <math>K</math> groups and <math>N</math> is the overall sample size. This ''F''-statistic follows the [[F-distribution|''F''-distribution]] with degrees of freedom <math>d_1=K-1</math> and <math>d_2=N-K</math> under the null hypothesis. The statistic will be large if the between-group variability is large relative to the within-group variability, which is unlikely to happen if the [[expected value|population means]] of the groups all have the same value.  ====The Elbow Method Justification====
An attempt at a paragraph justifying the 'heuristic' method:

Navigation menu