Consistent Cluster Order with Kmeans in R

Consistent Cluster Order with Kmeans in R - r

This might not be possible, but Google has failed me so far so I'm hoping someone else might have some insight. Sorry if this has been asked before.
The background is, I have a database of information on different cities, so like name, population, pollution, crime, etc by year. I'm querying it to aggregate the data on a per-city basis and outputting the result to a table. That works fine.
The next step is I'm running the kmeans() function in R on the data set to find clusters, in testing I've found that 5 clusters is almost always a good choice via the "elbow method".
The issue I'm having is that these clusters have distinct meanings/interpretations, so I want to tag each row in the original data set with the cluster's interpretation for that row, not the cluster number. So I don't want to identify row 2 with "cluster 5", I want to say "low population, high crime, low income".
If R would output the clusters in the same order, say having cluster 5 always equate to the cluster of cities with "low population, high crime, low income", that would work fine, but it doesn't. For instance, if you run code like this:
> a = kmeans(city_date,centers=5)
> b = kmeans(city_date,centers=5)
> c = kmeans(city_date,centers=5)
The run this code:
a$centers
b$centers
c$centers
The clusters will all contain the same data set, but the cluster number will be different. So if I have a mapping table in SQL that has cluster number and interpretation, it won't work, because when I run it one day it might have the "low population, high crime, low income" cluster as 5, and the next it might be 2, the next 4, etc.
What I'm trying to figure out is if there is a way to keep the output consistent. The data set gets updated so it won't even be the same every time, and since R doesn't keep the cluster order consistent even with the same data set, I am wondering if it will be possible at all.
Thanks for any help anyone can provide. On my end my current idea is to output the $centers data to a SQL table, then order the table by the various metrics, each time the one with the highest/lowest getting tagged as such, and then concatenating the results to tag the level. This may work but isn't very elegant.

I know this is a very old post, but I only came across it now. I had the same problem today and adapted the suggestion by Barker to come up with a solution:
library(dplyr)
# create a random data frame
df <- data.frame(id = 1:10, obs = sample(0:500, 10))
# use kmeans a first time to get the centers
centers <- kmeans(df$obs, centers = 3)$centers
# order the centers
centers <- sort(centers)
# call kmeans again but this time passing the centers calculated in the previous step
clusteridx <- kmeans(df$obs, centers = centers)$cluster
Not very elegant, but it works. The clusteridx vector will always return the cluster number based on the centers in ascending order.
This can also be collapsed into just one line if you prefer:
clusteridx <- kmeans(df$obs, centers = sort(kmeans(df$obs, centers = 3)$centers))$cluster

Usually k-means are initialized randomly few times to avoid local minimums. If you want to have resulting clusters ordered, you have to order them manually after k-means algorithm stops to work.

I haven't done this myself so I am not sure it will work, but kmeans has the parameter:
centers - either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.
If you know know basically where the clusters should be (perhaps by getting the cluster centers from a dataset you are matching to), you could use that to initialize the model. That would make the starting locations non-random, so the clusters should stay in the same order. Also, as an added benefit, initializing the cluster centers close to where they will end up should speed up your clustering.
Edit
I just checked using the data from the kmeans example but initializing with the first datapoint at (1,1) and the second at (0,0) (the means of the distributions used to makes the clusters) as below.
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, matrix(c(1,0,1,0),ncol=2)))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
After repeated runs, I found that the first cluster was always in the top right and the second in the bottom left where as initializing with 2 clusters caused then to switch back and forth. If you have some approximate starting values for your clusters (ie quantification for "low population, high crime, low income") that could be your initialization and give you the results you want.

This function runs kmeans with 1-dimensional input and returns a normal "kmeans" object with sensibly numbered clusters, without having to run the kmeans twice.
ordered_kmeans = function(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"),
trace = FALSE,
desc = TRUE) {
if (NCOL(x) > 1) {
stop("only one-dimensional inputs are allowed")
}
k = kmeans(x = x, centers = centers, iter.max = iter.max, nstart = nstart,
algorithm = algorithm, trace = trace)
centers_ind = order(k$centers, decreasing = desc)
centers_ord = setNames(seq_along(k$centers), nm = centers_ind)
k$cluster = unname(centers_ord[as.character(k$cluster)])
k$centers = matrix(k$centers[centers_ind], ncol = 1)
k$withinss = k$withinss[centers_ind]
k$size = k$size[centers_ind]
k
}
Example usage:
vec = c(20.28, 9.49, 7.14, 2.48, 2.36, 1.82, 1.3, 1.26, 1.11, 0.98,
0.81, 0.73, 0.66, 0.63, 0.57, 0.53, 0.44, 0.42, 0.38, 0.37, 0.33,
0.29, 0.28, 0.27, 0.26, 0.23, 0.23, 0.2, 0.18, 0.16, 0.15, 0.14,
0.14, 0.12, 0.11, 0.1, 0.1, 0.08)
# For comparispon
set.seed(1)
k = kmeans(vec, centers = 3); k
set.seed(1)
k = ordered_kmeans(vec, centers = 3); k
set.seed(1)
k = ordered_kmeans(vec, centers = 3, desc = FALSE); k

Here's an example where you ascribe letter factor groups to the k-means clusters, ordered from A is low to C is high. The parameters can be altered to fit the data you have.
df <- data.frame(id = 1:10, obs = sample(0:500, 10))
km <- kmeans(df$obs, centers = 3)
km.order <- as.numeric(names(sort(km$centers[,1])))
names(km.order) <- toupper(letters)[1:3]
km.order <- sort(km.order)
clus.order <- factor(names(km.order[km$cluster]))

Related

Simulating a Markov Chain in R and Sequence Search

So I am working on simulating a Markov Chain with R in which the states are Sunny (S), cloudy (C) and rainy (R) and am looking to figure out the probability that a sunny day is followed by two consecutive cloudy days.
Here is what I have so far:
P = matrix(c(0.7, 0.3, 0.2, 0.2, 0.5, 0.6, 0.1, 0.2, 0.2), 3)
print(P)
x = c("S", "C", "R")
n = 10000
states = character(n+100)
states[1] = "C"
for (i in 2:(n+100)){
if (states[i-1] == "S") {cond.prob = P[1,]}
else if (states[i-1] == "C") {cond.prob = P[2,]}
else {cond.prob = P[3,]}
states[i]=sample(x, 1, prob = cond.prob )
}
print(states[1:100])
states = states[-(1:100)]
head(states)
tail(states)
states[1:200]
At the end of this I am left with a sequence of states. I am looking to divide this sequence into groups of three states (For the three days in the chain) and then count the number of those three set states that are equal to SCC.
I am running a blank on how I would go about doing this and any help would be greatly appreciated!!

Assuming you want a sliding window (i.e., SCC could occur in position 1-3, or 2-4, etc.), collapsing the states to a string and to a regex search should do the trick:
collapsed <- paste(states, collapse="")
length(gregexpr("SCC", collapsed)[[1]])
On the other hand, if you DON'T want a sliding window (i.e., SCC must be in position 1-3, or 4-6, or 7-9, etc.) then you can chop up the sequence using tapply:
indexer <- rep(1:(ceiling(length(states)/3)), each=3, length=length(states))
chopped <- tapply(states, indexer, paste0, collapse="")
sum(chopped == "SCC")

Eric provided the correct answer, but just for the sake of completeness:
you can get the probability you are looking for using the equilibrium distribution of the chain:
# the equilibrium distribution
e <- eigen(t(P))$vectors[,1]
e.norm <- e/sum(e)
# probability of SCC
e.norm[1] * P[1,2] * P[2,2]
This is less computationally expensive and will give you a more accurate estimate of the probability, because your simulation will be biased towards the initial state of the chain.

Accessing index of the top num.out outliers in baysout function from the 'dprep' R package

I am using the baysout function for outlier detection from the 'dprep' package in R. The returned value is supposed to be a 2 column matrix according to the R documentation. The first column contains the indexes of the top num.out (user defined number of outliers to return) and the second, the outlyingness measure for each index.
The problem is that I want to access the index number separately but I am not able to do this. The function is actually returning an num.out x 1 matrix as opposed to a num.out x 2 matrix. The index value and the outlyingness measure are there but I cannot access them separately. Please see sample code below:
# Install and load the dprep library
install.packages("dprep")
library(dprep)
# Create 5x3 matrix for input to baysout function
A = matrix(c(0.8, 0.4, 1.2, 0.4, 1.2, 1.1, 0.3,
0.1, 1.9, 1.1, 0.9, 1.4, 0.3, 1.5, 0.5), nrow=5, ncol=3)
# Run the baysout function on matrix A and store result in outliers
outliers <- baysout(A, blocks = 3, nclass=0, k = 3, num.out = 3)
# print out result
print(outliers)
# attempt to access the index
print(outliers[1,1])
Output is as follows:
print out result
print(outliers) [,1] 4 3.625798 3 2.901654 2 2.850419
attempt to access the index
print(outliers[1,1]) 4 3.625798
This is not the real data I am using which is much larger and I would like to gain access to the index. In the example above I would like to be able to access the number 4 on its own. It is coupled with the 3.625798 and I am not able to access each figure separately. Would anyone have any advice on how I could do this?

solution by ekstroem
Use:
index <- as.numeric(rownames(outliers))
The documentation may not be entirely correct. In any case the index is stored in the row names.

Spatial Autocorrelation Analysis (Global Moran's I) in R

I have a list of points I want to check for autocorrelation using Moran's I and by dividing area of interest by 4 x 4 quadrats.
Now every example I found on Google (e. g. http://www.ats.ucla.edu/stat/r/faq/morans_i.htm) uses some kind of measured value as the first input for the Moran's I function, no matter which library is used (I looked into the ape and spdep packages).
However, all I have are the points themselves I want to check the correlation for.
The problem is, as funny (or sad) as this might sound, I've no idea what I'm doing here. I'm not much of a (spatial) statistics guy, all I want to find out is if a collection of points is dispersed, clustered or ramdom using Moran's I.
Is my approach correct? If not where and what I am doing wrong?
Thanks
This is what I have so far:
# download, install and load the spatstat package (http://www.spatstat.org/)
install.packages("spatstat")
library(spatstat)
# Download, install and run the ape package (http://cran.r-project.org/web/packages/ape/)
install.packages("ape")
library(ape)
# Define points
x <- c(3.4, 7.3, 6.3, 7.7, 5.2, 0.3, 6.8, 7.5, 5.4, 6.1, 5.9, 3.1, 5.2, 1.4, 5.6, 0.3)
y <- c(2.2, 0.4, 0.8, 6.6, 5.6, 2.5, 7.6, 0.3, 3.5, 3.1, 6.1, 6.4, 1.5, 3.9, 3.6, 5.2)
# Store the coordinates as a matrix
coords <- as.matrix(cbind(x, y))
# Store the points as two-dimensional point pattern (ppp) object (ranging from 0 to 8 on both axis)
coords.ppp <- as.ppp(coords, c(0, 8, 0, 8))
# Quadrat count
coords.quadrat <- quadratcount(coords.ppp, 4)
# Store the Quadrat counts as vector
coords.quadrat.vector <- as.vector(coords.quadrat)
# Replace any value > 1 with 1
coords.quadrat.binary <- ifelse(coords.quadrat.vector > 1, 1, coords.quadrat.vector)
# Moran's I
# Generate the distance matrix (euclidean distances between points)
coords.dists <- as.matrix(dist(coords))
# Take the inverse of the matrix
coords.dists.inv <- 1/coords.dists
# replace the diagonal entries (Inf) with zeroes
diag(coords.dists.inv) <- 0
writeLines("Moran's I:")
print(Moran.I(coords.quadrat.binary, coords.dists.inv))
writeLines("")

There's a few ways of doing this. I took a great (free) course in analysing spatial data with R by Roger Bivand who is very active on the r-sig-geo mailing list (where you may want to direct this query). You basically want to assess whether or not your point pattern is completely spatially random or not.
You can plot the empirical cumulative distribution of nearest neighbour distances of your observed points, and then compare this to the ecdf of randomly generated sets of completely spatially random point patterns within your observation window:
# The data
coords.ppp <- ppp( x , y , xrange = c(0, 8) , yrange = c(0, 8) )
# Number of points
n <- coords.ppp$n
# We want to generate completely spatially random point patterns to compare against the observed
ex <- expression( runifpoint( n , win = owin(c(0,8),c(0,8))))
# Reproducible simulation
set.seed(1)
# Compute a simulation envelope using Gest, which estimates the nearest neighbour distance distribution function G(r)
res <- envelope( coords.ppp , Gest , nsim = 99, simulate = ex ,verbose = FALSE, savefuns = TRUE )
# Plot
plot(res)
The observed nearest neighbour distribution is completely contained within the grey envelope of the ecdf of randomly generated point patterns. My conclusion would be that you have a completely spatially random point pattern, with the caveat that you don't have many points.
As an aside, where the black observed line falls below the grey envelope we may infer that points are further apart than would be expected by chance and vice versa above the envelope.

interactively work with xy point plot clusters - group manipulation in r

I have a large number of pair of X and Y variables along with their cluster membership column. Cluster membership (group) may not be always right (limitation in perfection of clustering algorithm), I want to interactively visualize the clusters and manipulate the cluster memberships to identified points.
I tried rggobi and the following is the point I was able to get to (I do not mean that I need to use rggobi / ggobi, if better options are available you are welcome to suggest).
# data
set.seed (1234)
c1 <- rnorm (40, 0.1, 0.02); c2 <- rnorm (40, 0.3, 0.01)
c3 <- rnorm (40, 0.5, 0.01); c4 <- rnorm (40, 0.7, 0.01)
c5 <- rnorm (40, 0.9, 0.03)
Yv <- 0.3 + rnorm (200, 0.05, 0.05)
myd <- data.frame (Xv = round (c(c1, c2, c3, c4, c5), 2), Yv = round (Yv, 2),
cltr = factor (rep(1:5, each = 40)))
require(rggobi)
g <- ggobi(myd)
display(g[1], vars=list(X="Xv", Y="Yv"))
You can see five clusters, colored differently with cltr variable. I manually identified the points that are outliers and I want to make their value to NA in the cltr variable. Is their any easy way to disassociate such membership and write to file.

You could try identify to get the indices of the points manually:
## use base::plot
plot(myd$Xv, myd$Yv, col=myd$cltr)
exclude <- identify(myd$Xv, myd$Yv) ## left click on the points you want to exclude (right click to stop/finish)
myd$cltr[exclude] <- NA

Compare two user defined curves and score their similarity

I have a set of 2 curves (each with a few hundreds to a couple thousands datapoints) that I want to compare and get some similarity "score". Actually, I have >100 of those sets to compare... I am familiar with R (or at least bioconductor) and would like to use it.
I tried the ccf() function but I'm not too happy about it.
For example, if I compare c1 to the following curves:
c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)
c1b <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5) # perfect match! ideally score of 1
c1c <- c(1, 0.2, 0.1, 0.1, 0.5, 0.9, 0.5) # total opposite, ideally score of -1? (what would 0 be though?)
c2 <- c(0, 0.9, 0.9, 0.9, 0, 0.3, 0.3, 0.9) #pretty good, score of ???
Note that the vectors don't have the same size and it needs to be normalized, somehow... Any idea?
If you look at those 2 lines, they are fairly similar and I think that in a first step, measuring the area under the 2 curves and subtracting would do. I look at the post "Shaded area under 2 curves in R" but that is not quite what I need.
A second issue (optional) is that for lines that have the same profile but different amplitude, I would like to score those as very similar even though the area under them would be big:
c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)
c4 <- c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) # very good, score of ??
I hope that a biologist pretending to formulate problem to programmer is OK...
I'd be happy to provide some real life examples if needed.
Thanks in advance!

They don't form curves in the usual meaning of paired x.y values unless they are of equal length. The first three are of equal length and after packaging in a matrix the rcorr function in HMisc package returns:
> rcorr(as.matrix(dfrm))[[1]]
c1 c1b c1c
c1 1 1 -1
c1b 1 1 -1
c1c -1 -1 1 # as desired if you scaled them to 0-1
The correlation of the c1 and c4 vectors:
> cor( c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5),
c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) )
[1] 0.9874975

I do not have a very good answer, but I did face similar question in the past, probably on more than 1 occasion. My approach is to answer to myself what makes my curves similar when I subjectively evaluate them (the scientific term here is "eye-balling" :). Is it the area under the curve? Do I count linear translation, rotation, or scaling (zoom) of my curves as contributing to dissimilarity? If not, I take out all the factors that I do not care about by selected normalization (e.g. scale the curves to cover the same ranges in x and y).
I am confident that there is a rigorous mathematical theory for this topic, I would search for the words "affinity" "affine". That said, my primitive/naive methods usually sufficed for the work I was doing.
You may want to ask this question on some math forum.

If the proteins you compare are reasonably close orthologs, you should be able to obtain alignments for either each pair you want to score the similarity of, or a multiple alignment for the entire bunch. Depending on the application, I think the latter will be more rigorous. I would then extract the folding score of only those amino acids that are aligned so that all profiles have the same length, and calculate correlation measures or squared normalized dot-products of the profiles as a similarity measure. The squared normalized dot product or the spearman rank correlation will be less sensitive to amplitude differences, which you seem to want. That will make sure you are comparing elements which are reasonable paired (to the extent the alignment is reasonable), and will let you answer questions like: "Are corresponding residues in the compared proteins generally folded to a similar extent?".