Select the most dissimilar individual using cluster analysis [duplicate] - r

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.

I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)

regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Related

Clustering ranking

I'm analyzing a data in R where predictor variables are available but there is no response variable. Using unsupervised learning (k-means) I have identified patterns in the data. But I need to rank the clusters according to their overall performance (example: student's data on exam marks and co-curricular marks). What technique do I use after clustering in R?
The cluster attribute of the kmeans output gives you the index of which cluster each data point is in. Example data taken from kmeans documentation:
nclusters = 5
# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, nclusters, nstart = 25)
Now, your evaluation function (e.g. mean of column values) can be applied to each cluster individually:
for (i in 1:nclusters){
cat(i, apply(x[which(cl$cluster==i),],MARGIN=2,FUN=mean), '\n')
}
Or better still, use some kind of aggregation function, e.g. tapply or aggregate, e.g.:
aggregate(x, by=list(cluster=cl$cluster), FUN=mean)
which gives
cluster x y
1 1 1.2468266 1.1499059
2 2 -0.2787117 0.0958023
3 3 0.5360855 1.0217910
4 4 1.0997776 0.7175210
5 5 0.2472313 -0.1193551
At this point you should be able to rank the values of the aggregation function as needed.

How can I find numérical intervals of k-means clusters?

I'm trying to discretize a numerical variable using Kmeans.
It worked pretty well but I'm wondering how I can find the intervals in my cluster.
I work with FactoMineR to do my kmeans.
I found 3 clusters according to the following graph :
My point now is to identify the intervals of my numerical variable within the clusters.
Is there any option or method in FactoMineR or other package to do it ?
I can do it manually but as I have to do it for a certain amount of variables, I'd like to found an easy way to identify them.
Since you have not provided data I have used the example from the kmeans documentation, which produces two groups for data with two columns x and y. You may split the original data by the cluster each row belongs to and then extract data from each group. I am not sure if my example data resembles your data, but in below code I have simply used the difference between min value of column x and max value of column y as the boundaries of a potential interval (depending on the use case this makes sense or not). Does that help you?
data <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(data) <- c("x", "y")
cl <- kmeans(data, 2)
data <- as.data.frame(cbind(data, cluster = cl$cluster))
lapply(split(data, data$cluster), function(x) {
min_x <- min(x$x)
max_y <- max(x$y)
diff <- max_y-min_x
c(min_x = min_x , max_y = max_y, diff = diff)
})
# $`1`
# min_x max_y diff
# -0.6906124 0.5123950 1.2030074
#
# $`2`
# min_x max_y diff
# 0.2052112 1.6941800 1.4889688

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

How to import a distance matrix for clustering in R

I have got a text file containing 200 models all compared to eachother and a molecular distance for each 2 models compared. It looks like this:
1 2 1.2323
1 3 6.4862
1 4 4.4789
1 5 3.6476
.
.
All the way down to 200, where the first number is the first model, the second number is the second model, and the third number the corresponding molecular distance when these two models are compared.
I can think of a way to import this into R and create a nice 200x200 matrix to perform some clustering analyses on. I am still new to Stack and R but thanks in advance!
Since you don't have the distance between model1 and itself, you would need to insert that yourself, using the answer from this question:
(you can ignore the wrong numbering of the models compared to your input data, it doesn't serve a purpose, really)
# Create some dummy data that has the same shape as your data:
df <- expand.grid(model1 = 1:120, model2 = 2:120)
df$distance <- runif(n = 119*120, min = 1, max = 10)
head(df)
# model1 model2 distance
# 1 2 7.958746
# 2 2 1.083700
# 3 2 9.211113
# 4 2 5.544380
# 5 2 5.498215
# 6 2 1.520450
inds <- seq(0, 200*119, by = 200)
val <- c(df$distance, rep(0, length(inds)))
inds <- c(seq_along(df$distance), inds + 0.5)
val <- val[order(inds)]
Once that's in place, you can use matrix() with the ncol and nrow to "reshape" your vector of distance in the appropriate way:
matrix(val, ncol = 200, nrow = 200)
Edit:
When your data only contains the distance for one direction, so only between e.g. model1 - model5 and not model5 - model1 , you will have to fill the values in the upper triangular part of a matrix, like they do here. Forget about the data I generated in the first part of this answer. Also, forget about adding the ones to your distance column.
dist_mat <- diag(200)
dist_mat[upper.tri(dist_mat)] <- your_data$distance
To copy the upper-triangular entries to below the diagonal, use:
dist_mat[lower.tri(dist_mat)] <- t(dist_mat)[lower.tri(dist_mat)]
As I do not know from your question what format is your file in, I will assume the most general file format, i.e., CSV.
Then you should look at the reading files, read.csv, or fread.
Example code:
dt <- read.csv(file, sep = "", header = TRUE)
I suggest using data.table package. Then:
setDT(dt)
dt[, id := paste0(as.character(col1), "-", as.character(col2))]
This creates a new variable out of the first and the second model and serves as a unique id.
What I do is then removing this id and scale the numerical input.
After scaling, run clustering algorithms.
Merge the result with the id to analyse your results.
Is that what you are looking for?

Select the most dissimilar individual using cluster analysis

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Resources