How to find the right cluster algorithm? - r

I would like to find the algorithm which circumvent some drawbacks of k-Means:
Given:
x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)
matrix<-cbind(x,y)# defining matrix
Kmeans<-kmeans(matrix,centers=2) # with 3 centroids
plot(x,y,col=Kmeans$cluster,pch=19,cex=2)
points(Kmeans$centers,col=1:3,pch=3,cex=3,lwd=3)
Here I would like have an algorithm clustering the data into two clusters divided by a diagonal from left corner to right corner.

Try Mclust from the mclust package, it will try to fit a Gaussian mixture on your data.
The default behavior:
mc = Mclust(matrix);
points(t(mc$parameters$mean));
plot(mc);
.. will find 4 groups, but you might be able to force it to 2 or to force the correlation structure for the Gaussians to be stretched in the right direction.
Be aware that it'll be hard to interpret and justify the meaning of your groups unless you understand very well the reason why you want them to be 2 etc..

What you are asking for can be solved in multiple ways. Here are two:
First way is to simply define the separating line of you clusters. Since you know how your points should be grouped (by a line) you can use that.
If you want your line to start at the origin, then simply check if x > y:
x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)
thePoints <- cbind(x,y)
as.integer(thePoints[,1] > thePoints[,2])
[1] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
This will put points above the diagonal (starting at 0) in one group, and others - to another group. Keep in mind that if your line may not go through the origin (0) then you have to modify this example a bit.
Kmeans with correlation distance:
The K-means function:
myKmeans <- function(x, centers, distFun, nItter=10) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
list(clusters=clusterHistory, centers=centerHistory)
}
And correlation distance:
myCor <- function(points1, points2) {
return(1 - ((cor(t(points1), t(points2))+1)/2))
}
theResult <- myKmeans(mat, centers, myCor, 10)
As was also displayed HERE
Here how both solution would look like:
plot(points, col=as.integer(points[,1] > points[,2])+1, main="Using a line", xlab="x", ylab="y")
plot(points, col=theResult$clusters[[10]], main="K-means with correlation clustering", xlab="x", ylab="y")
points(theResult$centers[[10]], col=1:2, cex=3, pch=19)
So it's more about what kind of distance measure you are using and not about some kind of deficiency of K-means.
You can also find better implementations of K-means with correlation distance for R instead of using the one I provided here.

Related

Shrink convex hull

I have a bunch of points in 2D space and have calculated a convex hull for them. I would now like to "tighten" the hull so that it no longer necessarily encompasses all points. In the typical nails-in-board-with-rubber-band analogy, what I'd like to achieve is to be able to tune the elasticity of the rubber band and allow nails to bend at pressure above some limit. That's just an analogy, there is no real physics here. This would kind-of be related to the reduction in hull area if a given point was removed, but not quite because there could be two points that are very close to each-other. This is not necessarily related to outlier detection, because you could imagine a pattern where a large fractions of the nails would bend if they are on a narrow line (imagine a hammer shape for example). All of this has to be reasonably fast for thousands of points. Any hints where I should look in terms of algorithms? An implementation in R would be perfect, but not needed.
EDIT AFTER COMMENT: The three points I've labelled are those with largest potential for reducing the hull area if they are excluded. In the plot there is no other set of three points that would result in a larger area reduction. A naive implementation of what I'm looking for would maybe be to randomly sample some fraction of the points, calculate the hull area, remove each point on the hull iteratively, recalculate the area, repeat many times and remove points that tend to lead to high area reduction. Maybe this could be implemented in some random forest variant? It's not quite right though, because I would like the points to be removed one by one so that you get the following result. If you looked at all points in one go it would possibly be best to trim from the edges of the "hammer head".
Suppose I have a set of points like this:
set.seed(69)
x <- runif(20)
y <- runif(20)
plot(x, y)
Then it is easy to find the subset points that sit on the convex hull by doing:
ss <- chull(x, y)
This means we can plot the convex hull by doing:
lines(x[c(ss, ss[1])], y[c(ss, ss[1])], col = "red")
Now we can randomly remove one of the points that sits on the convex hull (i.e. "bend a nail") by doing:
bend <- ss[sample(ss, 1)]
x <- x[-bend]
y <- y[-bend]
And we can then repeat the process of finding the convex hull of this new set of points:
ss <- chull(x, y)
lines(x[c(ss, ss[1])], y[c(ss, ss[1])], col = "blue", lty = 2)
To get the point which will, on removal, cause the greatest reduction in area, one option would be the following function:
library(sp)
shrink <- function(coords)
{
ss <- chull(coords[, 1], coords[, 2])
outlier <- ss[which.min(sapply(seq_along(ss),
function(i) Polygon(coords[ss[-i], ], hole = FALSE)#area))]
coords[-outlier, ]
}
So you could do something like:
coords <- cbind(x, y)
new_coords <- shrink(coords)
new_chull <- new_coords[chull(new_coords[, 1], new_coords[, 2]),]
new_chull <- rbind(new_chull, new_chull[1,])
plot(x, y)
lines(new_chull[,1], new_chull[, 2], col = "red")
Of course, you could do this in a loop so that new_coords is fed back into shrink multiple times.
Calculate a robust center and variance using mcd.cov in MASS and the mahalanobis distance of each point from it (using mahalanobis in psych). We then show a quantile plot of the mahalanobis distances using PlotMD from modi and also show the associated outliers in red in the second plot. (There are other functions in modi that may be of interest as well.)
library(MASS)
library(modi)
library(psych)
set.seed(69)
x <- runif(20)
y <- runif(20)
m <- cbind(x, y)
mcd <- cov.mcd(m)
md <- mahalanobis(m, mcd$center, mcd$cov)
stats <- PlotMD(md, 2, alpha = 0.90)
giving:
(continued after screenshot)
and we show the convex hull using lines and the outliers in red:
plot(m)
ix <- chull(m)
lines(m[c(ix, ix[1]), ])
wx <- which(md > stats$halpha)
points(m[wx, ], col = "red", pch = 20)
Thank you both! I've tried various methods for outlier detection, but it's not quite what I was looking for. They have worked badly due to weird shapes of my clusters. I know I talked about convex hull area, but I think filtering on segment lengths yields better results and is closer to what I really wanted. Then it would look something like this:
shrink <- function(xy, max_length = 30){
to_keep <- 1:(dim(xy)[1])
centroid <- c(mean(xy[,1]), mean(xy[,2]))
while (TRUE){
ss <- chull(xy[,1], xy[,2])
ss <- c(ss, ss[1])
lengths <- sapply(1:(length(ss)-1), function(i) sum((xy[ss[i+1],] - xy[ss[i],])^2))
# This gets the point with the longest convex hull segment. chull returns points
# in clockwise order, so the point to remove is either this one or the one
# after it. Remove the one furthest from the centroid.
max_point <- which.max(lengths)
if (lengths[max_point] < max_length) return(to_keep)
if (sum((xy[ss[max_point],] - centroid)^2) > sum((xy[ss[max_point + 1],] - centroid)^2)){
xy <- xy[-ss[max_point],]
to_keep <- to_keep[-ss[max_point]]
}else{
xy <- xy[-ss[max_point + 1],]
to_keep <- to_keep[-ss[max_point + 1]]
}
}
}
It's not optimal because it factors in the distance to the centroid, which I would have liked to avoid, and there is a max_length parameter that should be calculated from the data instead of being hard-coded.
No filter:
It looks like this because there are 500 000 cells in here, and there are many that end up "wrong" when projecting from ~20 000 dimensions to 2.
Filter:
Note that it filters out points at tips of some clusters. This is less-than-optimal but ok. The overlap between some clusters is true and should be there.

How to find out x/y shift of two raster layers?

regarding two raster layers which do not match exactly because of defective data, i would like to know, how to find out about the x/y shift between these two layers to align them properly using raster::shift()
i have already tried to investigate on the x/y-shift using qgis, but i just found the georeferencing tool, providing to relocate raster layers but not something interactive. i am looking for a possibility to move my defective raster on a basemap and getting information about the x/y shift.
i am NOT looking for a solution where i have to set specific georeferencing points to align the two raster layers since i am working on a highly dynamic landscape where it is difficult to find matching points, but where it is possible to align the raster layers by textural information provided by the datasets.
a code example should look like the solution provided by user #dTanMan URL:https://gis.stackexchange.com/users/77712/dtanman in this post URL:https://gis.stackexchange.com/a/201750
raster <- raster()
raster <- shift(raster, x=5, y=-15)
thanks a lot in advance, cheers, ExploreR
Perhaps you can use something like this
Example data
library(raster)
a <- raster(ncol=20, nrow=20, xmn=0,xmx=20,ymn=0,ymx=20)
values(a) <- 1:400
set.seed(3)
b <- a + runif(400)
Function to compare similarity of cell values
rmse <- function(obs, prd) {
sqrt(mean((obs-prd)^2, na.rm=TRUE))
}
Values from reference raster. May need to take a sample if raster is very large
nsamples <- 10000
s <- sampleRegular(a, nsamples, cells=TRUE)
sample_a <- s[,2]
Locations to be compared
xy <- xyFromCell(a, s[,1])
Test range for cell shifts
xrange <- -5:5 * xres(a)
yrange <- -5:5 * yres(a)
Matrix to store the results in
result <- cbind(rep(xrange, each=length(yrange)), rep(yrange, length(xrange)), NA)
colnames(result) <- c("dx", "dy", "rmse")
Loop over cellshift combinations
i <- 1
for (dx in xrange) {
for (dy in yrange) {
x <- shift(b, dx, dy)
sample_b <- extract(x, xy)
result[i,3] <- rmse(sample_a, sample_b)
i <- i + 1
}
}
Results suggest that dx=0 and dy=0 is the best in this case.
r <- result[order(result[,3]), ]
head(r)
# dx dy rmse
#[1,] 0 0 0.5734866
#[2,] 1 0 0.5800670
#[3,] -1 0 1.5252878
#[4,] 2 0 1.5302921
#[5,] -2 0 2.5153573
#[6,] 3 0 2.5157728
Test
bb <- shift(b, dx=r[1,1], dy=r[1,2])
rmse(values(a), values(bb))
#[1] 0.5734866

having extra values in a loop output in R

I have a graph like below:
adj <- read.table(text = "
A B C D
A 0 1 0 0
B 1 0 1 1
C 0 1 0 0
D 0 1 0 0
", header = T)
g <- graph_from_adjacency_matrix(as.matrix(adj))
I want to compute each node's distance like below:
distMat <- 1/2^distances(g)
for (i in 1:nrow(distMat)) {
res[i] <- sum(distMat[i, ]) - distMat[i, i]
}
names(res) <- V(g)$name[V(g)]
res
But the number of values in the result should be equal to the number of nodes in the graph but it's not and I have 5 values instead of 4. Any idea how to fix it?
Your code does not return 5 values. One reason that you may run into trouble is that your code is much more complicated than it needs to be. Look at what it does:
res <- 1:nrow(distMat)
for (i in 1:nrow(distMat)) {
res[i]<-sum( distMat[i,]) - distMat[i,i]
}
is a loop of your distance matrix where each row is summarised before subtracting distMat[i,i] which must always be 1 since distMat[i,i] refer's to each node's distance to itself. A nicer rewrite for the same would be:
res <- rowSums(1/2^distances(g))-1
It then becomes easier to see that kind of calculations are really centrality measures. The lowest sum of distances to all other nodes will be associated with high centrality.
iGraph has functions to compute a whole range of documented and established centrality measures. See degree(), closeness(), or betweenness(). What is the advantage of yours?
Look at your centrality measure and play around using this code:
# Make random graph with more nodes and calculate your centrality measure as res
graph <- erdos.renyi.game(80, 100, "gnm", directed=FALSE)
res <- rowSums(1/2^distances(graph))-1
# Colour each node according to their distance to every other node
colfunc <- colorRampPalette(c("yellow", "red"))
gradient <- colfunc(max(res, na.rm=T))
V(graph)$color <- gradient[round(res)+1]
# Plot it
plot(graph, vertex.label="")
Play around by using res <- betweenness(graph) for example.

colorRampPalette in R more than 2 clusters

I'd want to color my points with different colors in a fuzzy way, according to a probabilistic function associated to the points.
I've managed for 2 cases. First I'm building my dataset and a probabilities associated given 2 clusters.
set.seed(16)
rbPal <- colorRampPalette(c('yellow','red'))
(mu1<-c(0,0)) # vector mean multinom 1
(S1<-matrix(c(0.1,0,0,0.6),2)) # var/cov matrix multinom 1
(mu2<-c(3,0)) # vector mean multino 2
(S2<- matrix(c(1,0,0,0.1),2)) # var/cov matrix multinom 2
x1<-mvrnorm(n=100, mu=mu1,Sigma=S1 )
x2<-mvrnorm(n=100, mu=mu2,Sigma=S2 )
x<-rbind(x1,x2) # Dataset
euc.dist<-function (a,b){
sqrt(sum((a-b)^2))
}
randC<-x[sample(nrow(x),2),]
Distmatrix<- t(apply(x,1,function(r) apply(randC,1, function(s) euc.dist(r,
s))))
mat<-matrix(,200,2)
mat<-apply(mat,2,function(x) x=apply(Distmatrix,1, prod))/Distmatrix
P<-t(apply(mat, 1, function(x) x/sum(x)))
D4<-data.frame(x,P)
D4$Col <- rbPal(10)[as.numeric(cut(D4$X1.1,breaks = 10))]
plot(D4$X1,D4$X2,pch = 20,col = D4$Col, cex=1.2)
points(randC, col="red")
That's what I get imagining 2 points as centroid of a cluster.
What if I wanted to do the same color job imagining more than 2 clusters?
So I should have:
[...]
set.seed(50)
rbPal <- colorRampPalette(c('yellow','red',"green"))
mat<-matrix(,200,3)
randC<-x[sample(nrow(x),3),]
Distmatrix<- t(apply(x,1,function(r) apply(randC,1, function(s) euc.dist(r,
s))))
mat<-apply(mat,2,function(x) x=apply(Distmatrix,1, prod))/Distmatrix
P<-t(apply(mat, 1, function(x) x/sum(x)))
D4<-data.frame(x,P)
D4$Col <- rbPal(10)[as.numeric(cut(D4$X1.1,breaks = 10))]
plot(D4$X1,D4$X2,pch = 20,col = D4$Col, cex=1.2)
points(randC, col="red")
That's wrong, cause I want that each centroid has the maximum value for a color and then shade according to the distance depending on which cluster.
You may need to do the mixing function yourself.
If you have more than two clusters, a linear color space is not enough anymore.
The easiest choice is a linear mixing in each component. Straight forward to implement. For more advanced cases, you may want "balanced" points (where all distances are equal) to be gray, and not the average color.
As an ad-hoc solution, you could also set up palettes for each cluster, from gray to the clusters color. Then use (x_j-x_i)/x_j of the ith palette as value, where x_i is the smallest, and x_j the second smallest value. If x_i=x_j, the value will be 0 (gray). If x_i=0, the value will be 1.
This is probably quite pretty, but can be misleading because it doesn't use the same scaling everywhere.
I think I found a good solution, here's the cose:
set.seed(50)
mat<-matrix(,200,3)
randC<-x[sample(nrow(x),3),]
Distmatrix<- t(apply(x,1,function(r) apply(randC,1, function(s) euc.dist(r,
s))))
mat<-apply(mat,2,function(x) x=apply(Distmatrix,1, prod))/Distmatrix
P<-t(apply(mat, 1, function(x) x/sum(x)))
D4<-data.frame(x,P)
rbPal<-list()
for(i in 1:k){
rbPal[[i]] <- colorRampPalette(c('white',col=I(i+1)))
}
for(i in 1:k){
D4[[dim(D4)[2]+1]] <- rbPal[[i]](10)[as.numeric(cut(D4[[2+i]],breaks = 10))]
}
for(i in 1:k){
D4[[dim(D4)[2]+1]]<-t(col2rgb(D4[[dim(D4)[2]-k+1]]))
}
prova<-matrix(0,dim(D4)[1],3)
for(i in 1:k){
prova<-prova+D4[,(dim(D4)[2]-k+i)]*P[,i]
}
prova[is.nan(prova)] <- 0
provcol=apply(prova,1, function(x) rgb(x[1], x[2], x[3], maxColorValue=255))
plot(D4$X1,D4$X2,pch = 20,col = provcol, cex=1.5)
points(randC, col="red")
I basically created k different color palette each of them starting from white, which is the color in common of everybody. Then, according to probabilities, I mixed the rgb values of the k cluster probabilities with a weighted mixing.

How do I plot a cross correlation matrix for timeseries?

I have a timeseries representation of my data as follows (without the row and column) annotations:
L1 L2 L3 L4
t=1 0 1 1 0
t=2 0 1 1 1
t=3 1 0 1 1
t=4 0 1 1 0
I am reading this into R as:
timeseries = read.table("./test", header=F)
I am plotting timeseries for L1 using
ts.plot(timeseries$V1)
and plotting the cross-correlation function as:
ccf(timeseries$V1, timeseries$V2)
Now, can someone please tell me how do I plot a cross correlation matrix that shows the output of this function for L1-L4? Basically, something like this (in my case, a 4x4 matrix of plots):
There seems to be another trivial way of doing it!
timeseries = read.table("./test", header=F)
acf(timeseries)
gives me a matrix of correlation plots. Of course, there are other options that can be passed to acf if a covariance is needed.
A trivial way of doing this is to simply create a matrix of plots on your plotting device and place each ccf plot in one by one:
M <- matrix(sample(0:1,40,replace = TRUE),nrow = 10)
par(mfrow= c(4,4))
for (i in 1:4){
for (j in 1:4){
ccf(M[,i],M[,j])
}
}
But if you wait around a bit, someone who knows the time series packages more intimately may swing by with a function that does this a bit more nicely.
Try this where M is as in joran's post:
pnl <- function(x, y = x) { par(new = TRUE); ccf(x, y) }
pairs(as.data.frame(M), upper.panel = pnl, diag.panel = pnl, cex.labels = 1)

Resources