cutree alternative to extract cluster with given number of objects - r

While stats::cutree() takes an hclust-object and cuts it into a given number of clusters, I'm looking for a function that takes a given amount of elements and attempts to set k accordingly. In other words: Return the first cluster with n elements.
For example:
Searching for the first cluster with n = 9 objects.
library(psych)
data(bfi)
x <- bfi
hclust.res <- hclust(dist(abs(cor(na.omit(x)))))
cutree.res <- cutree(hclust.res, k = 2)
cutree.table <- table(cutree.res)
cutree.table
# no cluster with n = 9 elements
> cutree.res
1 2
23 5
while k = 3 yields
cutree.res <- cutree(hclust.res, k = 3)
# three clusters, whereas cluster 2 contains the required amount of objects
> cutree.table
cutree.res
1 2 3
14 9 5
Is there a more convenient way then iterating over this?
Thanks

You can easily write code for this yourself that only does one pass over the dendrogram rather than calling cutter in a loop.
Just execute the merges one by one and note the cluster sizes. Then keep the one that you "liked" the best.
Note that there might be no such solution. For example on the 1 dimensional data set -11 -10 +10 +11, cutting the dendrogram in merge order will return clusters with 1,2, or 4 elements only. So you'll have to handle this case, too.

Related

Getting the biggest connected component in R igraph

How do I get a subgraph of the the biggest component of a graph?
Say for example I have a graph g.
size_components_g <-clusters(g, mode="weak")$csize
size_components_g
#1 2 3 10 25 2 2 1
max_size <- max(size_components_g)
max_size
#25
So 25 is the biggest size.
I want to extract the component that has these 25 vertices. How do I do that?
Well, detailed explanation of output value of any function in the R package could be found in its documentation. In this case igraph::clusters returns a named list where in csize sizes of clusters are stored while membership contains the cluster id to which each vertex belongs to.
g <- igraph::sample_gnp(20, 1/20)
components <- igraph::clusters(g, mode="weak")
biggest_cluster_id <- which.max(components$csize)
# ids
vert_ids <- V(g)[components$membership == biggest_cluster_id]
# subgraph
igraph::induced_subgraph(g, vert_ids)

compare clusters' objects in R

I have two clustering results for the same variables but with different values each time. Let us create them with the following code:
set.seed(11)
a<-matrix(rnorm(10000),ncol=100)
colnames(a)<-(c(1:100))
set.seed(31)
b<-matrix(rnorm(10000),ncol=100)
colnames(b)<-colnames(a)
c.a<-hclust(dist(t(a)))
c.b<-hclust(dist(t(b)))
# clusters
groups.a<-cutree(c.a, k=15)
# take groups names
clus.a=list()
for (i in 1:15) clus.a[[i]] <- colnames(a)[groups.a==i]
# see the clusters
clus.a
groups.b<-cutree(c.b, k=15)
clus.b=list()
for (i in 1:15) clus.b[[i]] <- colnames(b)[groups.b==i]
# see the clusters
clus.b
What I get from that is two lists, clus.a and clus.b with the names (here just numbers from 1 to 100) of each cluster's variables.
Is there any way to examine if and which of the variables are clustered together in both clusterings? Meaning, how can I see if I have variables (could be teams of 2, 3, 4 etc) in same clusters for both clus.a and clus.b (doesn't have to be in the same cluster number).
If I understand your question correctly, you want to know if there are any clusters in a which have exactly the same membership as any of the clusters in b. Here's one way to do that.
Note: AFAICT in your example there are no matching clusters in a and b, so we create a few artificially to demo the solution.
# create artificial matches
clus.b[[3]] <- clus.a[[2]]
clus.b[[10]] <- clus.a[[8]]
clus.b[[15]] <- clus.a[[11]]
f <- function(a,b) (length(a)==length(b) & length(intersect(a,b))==length(a))
result <- sapply(clus.b,function(x)sapply(clus.a,f,b=x))
which(result, arr.ind=TRUE)
# row col
# [1,] 2 3
# [2,] 8 10
# [3,] 11 15
So this loops through all the clusters in b (sapply(clus.b,...)) and for each, loops through all the clusters in a looking for an exact match (in arbitrary order). For there to be a match, both clusters must have the same length, and the intersection of the two must contain all the elements in either - hence have the same length. This process produces a logical matrix with rows representing a and columns representing b.
Edit: To reflect the fact that OP is changing the question.
To detect clusters with two or more common elements, use:
f <- function(a,b) length(intersect(a,b))>1
result <- sapply(clus.b,function(x)sapply(clus.a,f,b=x))
matched <- which(result, arr.ind=TRUE)
matched
# row col
# [1,] 4 1
# [2,] 8 1
# [3,] 11 1
# [4,] 3 2
# ...
To identify which elements were present in both:
apply(matched,1,function(r) intersect(clus.a[[r[1]]],clus.b[[r[2]]]))

how to write a loop of the number of for loops in R?

this is probably a simple one, but I somehow got stuck...
I need to many loops to get the result of every sample in my support like the usual stacked loops:
for (a in 1:N1){
for (b in 1:N2){
for (c in 1:N3){
...
}
}
}
but the number of the for loops needed in this messy system depends on another random variable, let's say,
for(f in 1:N.for)
so how can I write a for loop to do deal with this? Or are there more elegant ways to do this?
note that the difference is that the nested for loops above (the variables a,b,c,...) do matter in my calculations, but the variable f of the for loop that controls for the number of for loops needed does not go into any of my calculations for my real purpose - all it does is count/ensure the number of for loops needed is correct.
Did I make it clear?
So what I am actually trying to do is generate all the possible combinations of a number of peoples preferences towards others.
Let's say I have 6 people (the simplest case for my purpose): Abi, Bob, Cath, Dan, Eva, Fay.
Abi and Bob have preference lists of C D E F ( 4!=24 possible permutations for each of them);
Cath and Dan have preference lists of A B and E F, respectively (2! * 2! = 4 possible permutations for each of them);
Eva and Fay have preference lists of A B C D (4!=24 possible permutations for each of them);
So all together there should be 24*24*4*4*24*24 possible permutations of preferences when taking all six them together.
I am just wondering what is a clear, easy and systematic way to generate them all at once?
I'd want them in the format such as
c.prefs <- as.matrix(data.frame(Abi = c("Eva", "Fay", "Dan", "Cath"),Bob = c("Dan", "Eva", "Fay", "Cath"))
but any clear format is fine...
Thank you so much!!
I'll assume you have a list of each loop variable and its maximum value, ordered from the outermost to innermost variable.
loops <- list(a=2, b=3, c=2)
You could create a data frame with all the loop variable values in the correct order with:
(indices <- rev(do.call(expand.grid, lapply(rev(loops), seq_len))))
# a b c
# 1 1 1 1
# 2 1 1 2
# 3 1 2 1
# 4 1 2 2
# 5 1 3 1
# 6 1 3 2
# 7 2 1 1
# 8 2 1 2
# 9 2 2 1
# 10 2 2 2
# 11 2 3 1
# 12 2 3 2
If the code run at the innermost point of the nested loop doesn't depend on the previous iterations, you could use something like apply to process each iteration independently. Otherwise you could loop through the rows of the data frame with a single loop:
for (i in seq_len(nrow(indices))) {
# You can get "a" with indices$a[i], "b" with indices$b[i], etc.
}
For the way of doing the calculation, an option is to use the Reduce function or some other higher-order function.
Since your data is not inherently ordered (an individual is part of a set, its preferences are part of the set) I would keep indivudals in a factor and have eg preferences in lists named with the individuals. If you have large data you can store it in an environment.
The first code is just how to make it reproducible. the problem domain was akin for graph oriented naming. You just need to change in the first line and in runif to change the behavior.
#people
verts <- factor(c(LETTERS[1:10]))
#relations, disallow preferring yourself
edges<-lapply(seq_along(verts), function(ind) {
levels(verts)[-ind]
})
names(edges) <- levels(verts)
#directions
#say you have these stored in a list or something
pool <- levels(verts)
directions<-lapply(pool, function(vert) {
relations <- pool[unique(round(runif(5, 1, 10)))]
relations[!(vert %in% relations)]
})
names(directions) = pool
num_prefs <- (lapply(directions, length))
names(num_prefs) <- names(directions)
#First take factorial of each persons preferences,
#then reduce that with multiplication
combinations <-
Reduce(`*`,
sapply(num_prefs, factorial)
)
I hope this answers your question!

Get ordered kmeans cluster labels

Say I have a data set x and do the following kmeans cluster:
fit <- kmeans(x,2)
My question is in regards to the output of fit$cluster: I know that it will give me a vector of integers (from 1:k) indicating the cluster to which each point is allocated. Instead, is there a way to have the clusters be labeled 1,2, etc... in order of decreasing numerical value of their center?
For example: If x=c(1.5,1.4,1.45,.2,.3,.3) , then fit$cluster should result in (1,1,1,2,2,2) but not result in (2,2,2,1,1,1)
Similarly, if x=c(1.5,.2,1.45,1.4,.3,.3) then fit$cluster should return (1,2,1,1,2,2), instead of (2,1,2,2,1,1)
Right now, fit$cluster seems to label the cluster numbers randomly. I've looked into documentation but haven't been able to find anything. Please let me know if you can help!
I had a similar problem. I had a vector of ages that I wanted to separate into 5 factor groups based on a logical ordinal set. I did the following:
I ran the k-means function:
k5 <- kmeans(all_data$age, centers = 5, nstart = 25)
I built a data frame of the k-means indexes and centres; then arranged it by centre value.
kmeans_index <- as.numeric(rownames(k5$centers))
k_means_centres <- as.numeric(k5$centers)
k_means_df <- data_frame(index=kmeans_index, centres=k_means_centres)
k_means_df <- k_means_df %>%
arrange(centres)
Now that the centres are in the df in ascending order, I created my 5 element factor list and bound it to the data frame:
factors <- c("very_young", "young", "middle_age", "old", "very_old")
k_means_df <- cbind(k_means_df, factors)
Looks like this:
> k_means_df
index centres factors
1 2 23.33770 very_young
2 5 39.15239 young
3 1 55.31727 middle_age
4 4 67.49422 old
5 3 79.38353 very_old
I saved my cluster values in a data frame and created a dummy factor column:
cluster_vals <- data_frame(cluster=k5$cluster, factor=NA)
Finally, I iterated through the factor options in k_means_df and replaced the cluster value with my factor/character value within the cluster_vals data frame:
for (i in 1:nrow(k_means_df))
{
index_val <- k_means_df$index[i]
factor_val <- as.character(k_means_df$factors[i])
cluster_vals <- cluster_vals %>%
mutate(factor=replace(factor, cluster==index_val, factor_val))
}
Voila; I now have a vector of factors/characters that were applied based on their ordinal logic to the randomly created cluster vector.
# A tibble: 3,163 x 2
cluster factor
<int> <chr>
1 4 old
2 2 very_young
3 2 very_young
4 2 very_young
5 3 very_old
6 3 very_old
7 4 old
8 4 old
9 2 very_young
10 5 young
# ... with 3,153 more rows
Hope this helps.
K-means is a randomized algorithm. It is actually correct when the labels are not consistent across runs, or ordered in "ascending" order.
But you can of course remap the labels as you like, you know...
You seem to be using 1-dimensional data. Then k-means is actually not the best choice for you.
In contrast to 2- and higher-dimensional data, 1-dimensional data can efficiently be sorted. If your data is 1-dimensional, use an algorithm that exploits this for efficiency. There are much better algorithms for 1-dimensional data than for multivariate data.

count cycles in network

What is the best way, or are there any ways implemented in are to count both 3 and 4 cycles in networks.
3 cycles equal connected groups of three nodes(triangles) to be calculated from one mode networks
4 cycles equal connected groups of four nodes(squares) to be calculated from two mode networks
If i have networks like this:
onemode <- read.table(text= "start end
1 2
1 3
4 5
4 6
5 6",header=TRUE)
twomode <- read.table(text= "typa typev
aa a
bb b
bb a
aa b",header=TRUE)
I thought
library(igraph)
g <- graph.data.frame(twomode)
E(g)
graph.motifs(g, size = 4)
would count the number of squares in my two mode network but I dont understand the output. I thought the result would be 1
?graph.motifs
graph.motifs searches a graph for motifs of a given size and returns a
numeric vector containing the number of different motifs. The order of
the motifs is defined by their isomorphism class, see graph.isoclass.
So the output of this is numeric vector where each value is the count of a certain motif(with sizes is 4 or 3) in your graph.
graph.motifs(g,size=4)
To get the total number of the motifs, you can use graph.motifs.no
graph.motifs.no(g,size=4)
[1] 1
Which is the number of the motif 20
which(graph.motifs(g,size=4) >0)
[1] 20
Another function that might be easier to use for this taks is kcycle.census {sna}. Details: http://svitsrv25.epfl.ch/R-doc/library/sna/html/path.census.html

Resources