I have two clustering results for the same variables but with different values each time. Let us create them with the following code:
set.seed(11)
a<-matrix(rnorm(10000),ncol=100)
colnames(a)<-(c(1:100))
set.seed(31)
b<-matrix(rnorm(10000),ncol=100)
colnames(b)<-colnames(a)
c.a<-hclust(dist(t(a)))
c.b<-hclust(dist(t(b)))
# clusters
groups.a<-cutree(c.a, k=15)
# take groups names
clus.a=list()
for (i in 1:15) clus.a[[i]] <- colnames(a)[groups.a==i]
# see the clusters
clus.a
groups.b<-cutree(c.b, k=15)
clus.b=list()
for (i in 1:15) clus.b[[i]] <- colnames(b)[groups.b==i]
# see the clusters
clus.b
What I get from that is two lists, clus.a and clus.b with the names (here just numbers from 1 to 100) of each cluster's variables.
Is there any way to examine if and which of the variables are clustered together in both clusterings? Meaning, how can I see if I have variables (could be teams of 2, 3, 4 etc) in same clusters for both clus.a and clus.b (doesn't have to be in the same cluster number).
If I understand your question correctly, you want to know if there are any clusters in a which have exactly the same membership as any of the clusters in b. Here's one way to do that.
Note: AFAICT in your example there are no matching clusters in a and b, so we create a few artificially to demo the solution.
# create artificial matches
clus.b[[3]] <- clus.a[[2]]
clus.b[[10]] <- clus.a[[8]]
clus.b[[15]] <- clus.a[[11]]
f <- function(a,b) (length(a)==length(b) & length(intersect(a,b))==length(a))
result <- sapply(clus.b,function(x)sapply(clus.a,f,b=x))
which(result, arr.ind=TRUE)
# row col
# [1,] 2 3
# [2,] 8 10
# [3,] 11 15
So this loops through all the clusters in b (sapply(clus.b,...)) and for each, loops through all the clusters in a looking for an exact match (in arbitrary order). For there to be a match, both clusters must have the same length, and the intersection of the two must contain all the elements in either - hence have the same length. This process produces a logical matrix with rows representing a and columns representing b.
Edit: To reflect the fact that OP is changing the question.
To detect clusters with two or more common elements, use:
f <- function(a,b) length(intersect(a,b))>1
result <- sapply(clus.b,function(x)sapply(clus.a,f,b=x))
matched <- which(result, arr.ind=TRUE)
matched
# row col
# [1,] 4 1
# [2,] 8 1
# [3,] 11 1
# [4,] 3 2
# ...
To identify which elements were present in both:
apply(matched,1,function(r) intersect(clus.a[[r[1]]],clus.b[[r[2]]]))
Related
While stats::cutree() takes an hclust-object and cuts it into a given number of clusters, I'm looking for a function that takes a given amount of elements and attempts to set k accordingly. In other words: Return the first cluster with n elements.
For example:
Searching for the first cluster with n = 9 objects.
library(psych)
data(bfi)
x <- bfi
hclust.res <- hclust(dist(abs(cor(na.omit(x)))))
cutree.res <- cutree(hclust.res, k = 2)
cutree.table <- table(cutree.res)
cutree.table
# no cluster with n = 9 elements
> cutree.res
1 2
23 5
while k = 3 yields
cutree.res <- cutree(hclust.res, k = 3)
# three clusters, whereas cluster 2 contains the required amount of objects
> cutree.table
cutree.res
1 2 3
14 9 5
Is there a more convenient way then iterating over this?
Thanks
You can easily write code for this yourself that only does one pass over the dendrogram rather than calling cutter in a loop.
Just execute the merges one by one and note the cluster sizes. Then keep the one that you "liked" the best.
Note that there might be no such solution. For example on the 1 dimensional data set -11 -10 +10 +11, cutting the dendrogram in merge order will return clusters with 1,2, or 4 elements only. So you'll have to handle this case, too.
I am dealing with the problem that I need to count unique names of people in a string, but taking into consideration that there may be slight typos.
My thought was to set strings below a certain threshold (e.g. levenshtein distance below 2) as being equal. Right now I manage to calculate the string distances, but not making any changes to my input string that would get me the correct number of unique names.
library(stringdist);library(stringr)
names<-"Michael, Liz, Miichael, Maria"
names_split<-strsplit(names, ", ")[[1]]
stringdistmatrix(names_split,names_split)
[,1] [,2] [,3] [,4]
[1,] 0 6 1 5
[2,] 6 0 7 4
[3,] 1 7 0 6
[4,] 5 4 6 0
(number_of_people<-str_count(names, ",")+1)
[1] 4
The correct value of number_of_people should be, of course, 3.
As I am only interested in the number of uniques names, I am not concerned if "Michael" becomes replaced by "Miichael" or the other way round.
One option is to try to cluster the names based on their distance matrix:
library(stringdist)
# create a 'dist' object (=lower triangular part of distance matrix)
d <- stringdistmatrix(names_split,method="osa")
# use hierarchical clustering to group nearest neighbors
hc <- hclust(d)
# visual inspection: y-axis labels the distance value
plot(hc)
# decide what distance value you find acceptable for grouping.
cutree(hc, h=3)
Depending on your actual data you will need to experiment with the distance type (qgrams/cosine may be useful, or the jaro-winkler distance in the case of names).
Motivation: I am currently trying to rethink my coding such as to exclude for-loops where possible. The below problem can easily be solved with conventional for-loops, but I was wondering if R offers a possibility to utilize the apply-family to make the problem easier.
Problem: I have a matrix, say X (n x k matrix) and two matrices of start and stop indices, called index.starts and index.stops, respectively. They are of size n x B and it holds that index.stops = index.starts + m for some integer m. Each pair index.starts[i,j] and index.stops[i,j] are needed to subset X as X[ (index.starts[i,j]:index.stops[i,j]),]. I.e., they should select all the rows of X in their index range.
Can I solve this problem using one of the apply functions?
Application: (Not necessarily important for understanding my problem.) In case you are interested, this is needed for a bootstrapping application with blocks in a time series application. The X represents the original sample. index.starts is sampled as replicate(repetitionNumber, sample.int((n-r), ceiling(n/r), replace=TRUE)) and index.stopsis obtained as index.stop = index.starts + m. What I want in the end is a collection of rows of X. In particular, I want to resample repetitionNumber times m blocks of length r from X.
Example:
#generate data
n<-100 #the size of your sample
B<-5 #the number of columns for index.starts and index.stops
#and equivalently the number of block bootstraps to sample
k<-2 #the number of variables in X
X<-matrix(rnorm(n*k), nrow=n, ncol = k)
#take a random sample of the indices 1:100 to get index.starts
r<-10 #this is the block length
#get a sample of the indices 1:(n-r), and get ceiling(n/r) of these
#(for n=100 and r=10, ceiling(n/r) = n/r = 10). Replicate this B times
index.starts<-replicate(B, sample.int((n-r), ceiling(n/r), replace=TRUE))
index.stops<-index.starts + r
#Now can I use apply-functions to extract the r subsequent rows that are
#paired in index.starts[i,j] and index.stops[i,j] for i = 1,2,...,10 = ceiling(n/r) and
#j=1,2,3,4,5=B ?
It's probably way more complicated than what you want/need, but here is a first approach. Just comment if that helps you in any way and I am happy to help.
My approach uses (multiple) *apply-functions. The first lapply "loops" over 1:B cases, where it first calculates the start and end points, which are combined into the take.rows (with subsetting numbers). Next, the inital matrix is subsetted by take.rows (and returned in a list). As a last step, the standard deviation is taken for each column of the subsetted matrizes (as a dummy function).
The code (with heavy commenting) looks like this:
# you can use lapply in parallel mode if you want to speed up code...
lapply(1:B, function(i){
starts <- sample.int((n-r), ceiling(n/r), replace=TRUE)
# [1] 64 22 84 26 40 7 66 12 25 15
ends <- starts + r
take.rows <- Map(":", starts, ends)
# [[1]]
# [1] 72 73 74 75 76 77 78 79 80 81 82
# ...
res <- lapply(take.rows, function(subs) X[subs, ])
# res is now a list of 10 with the ten subsets
# [[1]]
# [,1] [,2]
# [1,] 0.2658915 -0.18265235
# [2,] 1.7397478 0.66315385
# ...
# say you want to compute something (sd in this case) you can do the following
# but better you do the computing directly in the former "lapply(take.rows...)"
res2 <- t(sapply(res, function(tmp){
apply(tmp, 2, sd)
})) # simplify into a vector/data.frame
# [,1] [,2]
# [1,] 1.2345833 1.0927203
# [2,] 1.1838110 1.0767433
# [3,] 0.9808146 1.0522117
# ...
return(res2)
})
Does that point you in the right direction/gives you the answer?
Sorry if this has been posted before. I looked for the answer both on Google and Stackoverflow and couldn't find a solution.
Right now I have two matrices of data in R. I am trying to loop through each row in the matrix, and find the row in the other matrix that is most similar by some distance metric (for now least squared). I figured out one method but it is O(n^2) which is prohibitive for my data.
I think this might be similar to some dictionary learning techniques but I couldn't find anything.
Thanks!
Both matrices are just 30 by n matrices with a number at each entry.
distance.fun=function(mat1,mat2){
match=c()
for (i in 1:nrow(mat1)){
if (all(is.na(mat1[i,]))==FALSE){
dist=c()
for (j in 1:nrow(mat2)){
dist[j]=sum((mat1[i,]-mat2[j,])^2)
match[i]=which(min(dist) %in% dist)
}
}
}
return(match)
}
A better strategy would be to compute the distance matrix all at once first, then extract the mins. Here's an example using simualted data
set.seed(15)
mat1<-matrix(runif(2*25), ncol=2)
mat2<-matrix(runif(2*25), ncol=2)
and here's a helper function that can calculate the distances between values in one matrix to another. It uses the built in dist function but it does do unnecessary within-group comparisons that we eventually have to filter out, still it may be better performing overall.
distab<-function(m1, m2) {
stopifnot(ncol(m1)==ncol(m2))
m<-as.matrix(dist(rbind(m1, m2)))[1:nrow(m1), -(1:nrow(m1))]
rownames(m)<-rownames(m1)
colnames(m)<-rownames(m2)
m
}
mydist<-distab(mat1, mat2)
now that we have the between-group distances, we just need to minimize the matches.
best <- apply(mydist, 2, which.min)
rr <- cbind(m1.row=seq.int(nrow(mat1)), best.m2.row = best)
head(rr) #just print a few
# m1.row best.m2.row
# [1,] 1 1
# [2,] 2 14
# [3,] 3 7
# [4,] 4 3
# [5,] 5 23
# [6,] 6 15
note that with a strategy like this (we well as with your original implementation) it is possible for multiple rows from mat1 to match to the same row in mat2 and some rows in mat2 to be unmatched to mat1.
Say I have a data set x and do the following kmeans cluster:
fit <- kmeans(x,2)
My question is in regards to the output of fit$cluster: I know that it will give me a vector of integers (from 1:k) indicating the cluster to which each point is allocated. Instead, is there a way to have the clusters be labeled 1,2, etc... in order of decreasing numerical value of their center?
For example: If x=c(1.5,1.4,1.45,.2,.3,.3) , then fit$cluster should result in (1,1,1,2,2,2) but not result in (2,2,2,1,1,1)
Similarly, if x=c(1.5,.2,1.45,1.4,.3,.3) then fit$cluster should return (1,2,1,1,2,2), instead of (2,1,2,2,1,1)
Right now, fit$cluster seems to label the cluster numbers randomly. I've looked into documentation but haven't been able to find anything. Please let me know if you can help!
I had a similar problem. I had a vector of ages that I wanted to separate into 5 factor groups based on a logical ordinal set. I did the following:
I ran the k-means function:
k5 <- kmeans(all_data$age, centers = 5, nstart = 25)
I built a data frame of the k-means indexes and centres; then arranged it by centre value.
kmeans_index <- as.numeric(rownames(k5$centers))
k_means_centres <- as.numeric(k5$centers)
k_means_df <- data_frame(index=kmeans_index, centres=k_means_centres)
k_means_df <- k_means_df %>%
arrange(centres)
Now that the centres are in the df in ascending order, I created my 5 element factor list and bound it to the data frame:
factors <- c("very_young", "young", "middle_age", "old", "very_old")
k_means_df <- cbind(k_means_df, factors)
Looks like this:
> k_means_df
index centres factors
1 2 23.33770 very_young
2 5 39.15239 young
3 1 55.31727 middle_age
4 4 67.49422 old
5 3 79.38353 very_old
I saved my cluster values in a data frame and created a dummy factor column:
cluster_vals <- data_frame(cluster=k5$cluster, factor=NA)
Finally, I iterated through the factor options in k_means_df and replaced the cluster value with my factor/character value within the cluster_vals data frame:
for (i in 1:nrow(k_means_df))
{
index_val <- k_means_df$index[i]
factor_val <- as.character(k_means_df$factors[i])
cluster_vals <- cluster_vals %>%
mutate(factor=replace(factor, cluster==index_val, factor_val))
}
Voila; I now have a vector of factors/characters that were applied based on their ordinal logic to the randomly created cluster vector.
# A tibble: 3,163 x 2
cluster factor
<int> <chr>
1 4 old
2 2 very_young
3 2 very_young
4 2 very_young
5 3 very_old
6 3 very_old
7 4 old
8 4 old
9 2 very_young
10 5 young
# ... with 3,153 more rows
Hope this helps.
K-means is a randomized algorithm. It is actually correct when the labels are not consistent across runs, or ordered in "ascending" order.
But you can of course remap the labels as you like, you know...
You seem to be using 1-dimensional data. Then k-means is actually not the best choice for you.
In contrast to 2- and higher-dimensional data, 1-dimensional data can efficiently be sorted. If your data is 1-dimensional, use an algorithm that exploits this for efficiency. There are much better algorithms for 1-dimensional data than for multivariate data.