How to get member of clusters from R's hclust/heatmap.2 - r

I have the following code that perform hiearchical
clustering and plot them in heatmap.
library(gplots)
set.seed(538)
# generate data
y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
# the actual data is much larger that the above
# perform hiearchical clustering and plot heatmap
test <- heatmap.2(y)
Which plot this:
What I want to do is to get the cluster member from each hierarchy of in the plot
yielding:
Clust 1: g3-g2-g4
Clust 2: g2-g4
Clust 3: g4-g7
etc
Cluster last: g1-g2-g3-g4-g5-g6-g7-g8-g9-g10
Is there a way to do it?

I did have the answer, after all! #zkurtz identified the problem ... the data I was using were different than the data you were using. I added a set.seed(538) statement to your code to stabilize the data.
Use this code to create a matrix of cluster membership for the dendrogram of the rows using the following code:
cutree(as.hclust(test$rowDendrogram), 1:dim(y)[1])
This will give you:
1 2 3 4 5 6 7 8 9 10
g1 1 1 1 1 1 1 1 1 1 1
g2 1 2 2 2 2 2 2 2 2 2
g3 1 2 2 3 3 3 3 3 3 3
g4 1 2 2 2 2 2 2 2 2 4
g5 1 1 1 1 1 1 1 4 4 5
g6 1 2 3 4 4 4 4 5 5 6
g7 1 2 2 2 2 5 5 6 6 7
g8 1 2 3 4 5 6 6 7 7 8
g9 1 2 3 4 4 4 7 8 8 9
g10 1 2 3 4 5 6 6 7 9 10

This solution requires computing the cluster structure using a different packags:
# Generate data
y = matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
# The new packags:
library(nnclust)
# Create the links between all pairs of points with
# squared euclidean distance less than threshold
links = nncluster(y, threshold = 2, fill = 1, give.up =1)
# Assign a cluster number to each point
clusters=clusterMember(links, outlier = FALSE)
# Display the points that are "alone" in their own cluster:
nas = which(is.na(clusters))
print(rownames(y)[nas])
clusters = clusters[-nas]
# For each cluster (with at least two points), display the included points
for(i in 1:max(clusters, na.rm = TRUE)) print(rownames(y)[clusters == i])
Obviously you would want to revise this into a function of some kind to be more user friendly. In particular, this gives the clusters at only one level of the dendrogram. To get the clusters at other levels, you would have to play with the threshold parameter.

Related

how to filter out small subgraphs in R

suppose I have a network like this with multiple subgraphs.
How can I only keep the subgraph with the most number of vertices while removing the rest? In this case I want to keep the subgraph on the left and remove the 3-vertices one the lower right. Thanks!
Given
set.seed(1)
g <- sample_gnp(20, 1 / 20)
plot(g)
we wish to keep the subgraph with 6 vertices. Using
(clu <- components(g))
# $membership
# [1] 1 2 3 4 5 4 5 5 6 7 8 9 10 3 5 11 5 3 12 5
# $csize
# [1] 1 1 3 2 6 1 1 1 1 1 1 1
# $no
# [1] 12
gMax <- induced_subgraph(g, V(g)[clu$membership == which.max(clu$csize)])
we then get
plot(gMax)
This assumes that there is a single largest connected subgraph. Otherwise the "first" one will be chosen.

R: How to create a equal blocks of variables randomly?

I have a data frame of n = 20 variables (number of columns) spread over b = 5 blocks (4 variables per block).
I would like to create p = 4 random and equal sized blocks of variables from the 5 blocks of variables.
I tried :
sample (x = 1: p, size = n, replace = TRUE)
[1] 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 4 4 4 4 4
Example of expected result (5 variables per block):
[1] 4 1 2 1 4 2 3 1 2 3 2 1 4 3 1 2 3 3 4 4
Thanks for your help !
You can try:
sample(x = rep(1:p,n/p), size = n, replace = FALSE)
Having discussed this in comments below, here is a solution:
Create a vector that looks like what you want, and then use sample to randomly sort it by sampling the whole vector without replacement:
p <- 4
b <- 5
sample(rep(1:p, b), size = p * b)
[1] 3 1 4 3 3 4 1 1 4 2 2 4 3 2 1 2 2 4 3 1

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

Repeat vector to fill down column in data frame

Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?
If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.
If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)

Visualize strongly connected components in R

I have a weighted directed graph with three strongly connected components(SCC).
The SCCs are obtained from the igraph::clusters function
library(igraph)
SCC<- clusters(graph, mode="strong")
SCC$membership
[1] 9 2 7 7 8 2 6 2 2 5 2 2 2 2 2 1 2 4 2 2 2 3 2 2 2 2 2 2 2 2
SCC$csize
[1] 1 21 1 1 1 1 2 1 1
SCC$no
[1] 9
I want to visualize the SCCs with circles and a colored background as the graph below, is there any ways to do this in R? Thanks!
Take a look at the mark.groups argument of plot.igraph. Something like the following will do the trick:
# Create some toy data
set.seed(1)
library(igraph)
graph <- erdos.renyi.game(20, 1/20)
# Do the clustering
SCC <- clusters(graph, mode="strong")
# Add colours and use the mark.group argument
V(graph)$color <- rainbow(SCC$no)[SCC$membership]
plot(graph, mark.groups = split(1:vcount(graph), SCC$membership))

Resources