Multivariate Methods:

Ordination and kmeans

Session 11: R-Peer-Group

K-mean clustering

K-means in 2 dimensions

randomly initialize for \( k = 2 \)

plot of chunk unnamed-chunk-1

Identify point closest to each cluster centroid

plot of chunk unnamed-chunk-2

Define new centroids

Assign individuals to centroids

plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-7

Reassign individuals to new centroids

plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-9

K-means in 3 dimensions

Plot with centroids defined

plot of chunk unnamed-chunk-10

Plot data over centroids

plot of chunk unnamed-chunk-11

Test for the number of clusters (\( k \))

# mydata is a dataframe test for k = 1:15
set.seed(10)
wsskk <- sapply(1:15, function(i) {
    return(sum(kmeans(mydata, centers = i, iter.max = 10000)$withinss))
})

plot(1:15, wsskk, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")

abline(v = 5, col = "red")

plot of chunk unnamed-chunk-12

An example using real data

In this example we will use a data set collected by the United States Census
Bureau. The data describes demographic changes in 51 states between 2000 and
2001. An analyst is charged with identifying meaningful strucutre within the
data to allow state legislators to understand how their policies might be affecting their population.

First we test for the most likely value for \( k \)

# Read galaxy data
us_data <- read.delim("us_data.txt", header = TRUE)
set.seed(962520851)
usSS <- sapply(1:20, function(i) {
    return(sum(kmeans(us_data[, -1], centers = i, iter.max = 10000)$withinss))
})

plot(1:20, usSS, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")

plot of chunk unnamed-chunk-13

Next we run kmeans for the choosen \( k \)

usK <- kmeans(us_data[, -1], centers = 5, iter.max = 10000)

Now we need to find some meaningful way to visualise the results

# Try plotting a map
library("maps")
library("ggplot2")

# create a vector for colours
clust <- usK$cluster
names(clust) <- toupper(us_data[, 1])

# load US state data
states <- map_data("state")

clust_assign <- apply(states, 1, function(x) {
    return(clust[toupper(x[5])])
})


# plot the states plot all states with ggplot
p <- ggplot()
p <- p + geom_polygon(data = states, aes(x = long, y = lat, group = group), 
    colour = "white", fill = gray(1/clust_assign))
p

plot of chunk unnamed-chunk-15

References

Ordination analyses

Constrained ordination analyses

References