Multivariate Methods:
Ordination and kmeans
Session 11: R-Peer-Group
K-mean clustering
- kmeans is a simple method to identify clusters within data
- kmeans optimizes the partitioning of \( n \) individuals into \( k \) clusters
- Clusters are defined by a multi-dimensional mean (centroid)
- Individuals are assigned to clusters based on their proximity to
a clusters centroid
- With group sum of squares distance from centroid is minimized
K-means in 2 dimensions
randomly initialize for \( k = 2 \)
Identify point closest to each cluster centroid
Define new centroids
Assign individuals to centroids
Reassign individuals to new centroids
K-means in 3 dimensions
Plot with centroids defined
Plot data over centroids
Test for the number of clusters (\( k \))
# mydata is a dataframe test for k = 1:15
set.seed(10)
wsskk <- sapply(1:15, function(i) {
return(sum(kmeans(mydata, centers = i, iter.max = 10000)$withinss))
})
plot(1:15, wsskk, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")
abline(v = 5, col = "red")
An example using real data
In this example we will use a data set collected by the United States Census
Bureau. The data describes demographic changes in 51 states between 2000 and
2001. An analyst is charged with identifying meaningful strucutre within the
data to allow state legislators to understand how their policies might be affecting their population.
First we test for the most likely value for \( k \)
# Read galaxy data
us_data <- read.delim("us_data.txt", header = TRUE)
set.seed(962520851)
usSS <- sapply(1:20, function(i) {
return(sum(kmeans(us_data[, -1], centers = i, iter.max = 10000)$withinss))
})
plot(1:20, usSS, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")
Next we run kmeans
for the choosen \( k \)
usK <- kmeans(us_data[, -1], centers = 5, iter.max = 10000)
Now we need to find some meaningful way to visualise the results
# Try plotting a map
library("maps")
library("ggplot2")
# create a vector for colours
clust <- usK$cluster
names(clust) <- toupper(us_data[, 1])
# load US state data
states <- map_data("state")
clust_assign <- apply(states, 1, function(x) {
return(clust[toupper(x[5])])
})
# plot the states plot all states with ggplot
p <- ggplot()
p <- p + geom_polygon(data = states, aes(x = long, y = lat, group = group),
colour = "white", fill = gray(1/clust_assign))
p
References
- Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review.
ACM computing surveys (CSUR), 31(3), 264-323.
- Crawley, M. J. (2012). The R book. Wiley.
- Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer
Publishing Company, Incorporated.

Ordination analyses
- Attempts to describe multidimensional variance in lower dimensional space
- These lower dimensions become the axes of the plot and should summarise
the relationships between variables (sites, species or explanatory variables)
in multivariate space
- These techniques are always based on an initial distance matrix between
samples/sites (e.g. Euclidean distance, chi sq distance, bray curtis distance)
- There are many techniques and metrics available today I will focus on
Principle Components Analysis (PCA -based on euclidean distance) and
corresponsdance analysis (CA - based on chi sq distance, suitable for
species data)
Constrained ordination analyses
- Ordination space is constrained by explanatory variables
(i.e. axes of the new “lower dimensional space are now combinations
predictors variables such as chemistry, habitat, invasion state etc.)
- P-values are permutation based
- Here I will discuss Canonical correspondance analysis (CCA) and extension
of (CA). Other commonly used approaches include RDA (an extension of PCA)
and permanova (which uses a bray curtis metric by default (function = adonis
in package “vegan”)
References
- Bocard et al 2011, Numerical Ecology with R
- Oksanen (2011) Multivariate Analysis of Ecological Communities in R: vegan
tutorial (available online in pdf)