Interpreting the result of 'cutree' from hclust/heatmap.2 - r

I have the following code that perform hiearchical clustering and plot
them in heatmap.
set.seed(538)
# generate data
y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""),
paste("t", 1:5, sep="")))
# the actual data is much larger that the above
# perform hiearchical clustering and plot heatmap
test <- heatmap.2(y)
What I want to do is to print the cluster member from each hierarchy
of in the plot. I'm not sure what's the good way to do it.
I tried this:
cutree(as.hclust(test$rowDendrogram), 1:dim(y)[1])
But having problem in interpreting the result.
What's the meaning of each value in the matrix?
For example g9-9=8 . What does 8 mean here?
1 2 3 4 5 6 7 8 9 10
g1 1 1 1 1 1 1 1 1 1 1
g2 1 2 2 2 2 2 2 2 2 2
g3 1 2 2 3 3 3 3 3 3 3
g4 1 2 2 2 2 2 2 2 2 4
g5 1 1 1 1 1 1 1 4 4 5
g6 1 2 3 4 4 4 4 5 5 6
g7 1 2 2 2 2 5 5 6 6 7
g8 1 2 3 4 5 6 6 7 7 8
g9 1 2 3 4 4 4 7 8 8 9
g10 1 2 3 4 5 6 6 7 9 10
Your expert advice will be greatly appreciated.

Column j tells you how your gs should be grouped if you wanted exactly j groups.
Columns 1 and 10 are not very useful, but maybe column 2 is a good example. It is telling you that if you wanted exactly two groups then they would be:
group1: {g1, g5}
group2: {g2, g3, g4, g6, g7, g8, g9, g10}

Related

merge/join two long df in R

I have two dataframes a and b which I would like to combine
a <- data.frame(g=c("1","2","2","3","3","3","4","4","4","4"),h=c("1","1","2","1","2","3","1","2","3","4"))
b <- data.frame(g=c("1","2","3","3","3","4","4","4","4","4"),i=c("1","2","3","2","1","2","3","4","5","6"))
g represents a grouping variable and h and i the columns I want to merge/join
> a
g h
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
7 4 1
8 4 2
9 4 3
10 4 4
> b
g i
1 1 1
2 2 2
3 3 3
4 3 2
5 3 1
6 4 2
7 4 3
8 4 4
9 4 5
10 4 6
a and b should be merged on the level of the grouping variable g whereas identical values of h and i should be put together (independant of the order they appear in h/i) and not identical values should be combined once (not all possible combinations).
a final df would look like:
g h i
1 1 1 1
2 2 1 <NA>
3 2 2 2
4 3 1 1
5 3 2 2
6 3 3 3
7 4 1 <NA>
8 4 2 2
9 4 3 3
10 4 4 4
11 4 <NA> 5
12 4 <NA> 6
I need that df to perform a correlation analysis.
Sounds like a merge on h==i, while retaining i, so create a new variable x to join on, and keep join results from both sides (all=TRUE). With a large hat-tip to #Moody_Mudskipper:
merge(transform(a,x=h), transform(b,x=i), all=TRUE)
# g x h i
#1 1 1 1 1
#2 2 1 1 <NA>
#3 2 2 2 2
#4 3 1 1 1
#5 3 2 2 2
#6 3 3 3 3
#7 4 1 1 <NA>
#8 4 2 2 2
#9 4 3 3 3
#10 4 4 4 4
#11 4 5 <NA> 5
#12 4 6 <NA> 6
We can also do this with dplyr
library(dplyr)
a %>%
mutate(x = h) %>%
full_join(mutate(b, x = i)) %>%
select(-x)

Compute degree of each vertex from data frame

I have the following dataset:
V1 V2
2 1
3 1
3 2
4 1
4 2
4 3
5 1
6 1
7 1
7 5
7 6
I tried to compute the degree of each vertex with the code
e<-read.table("ex.txt")
library(igraph)
g1<-graph.data.frame(e, directed=FALSE)
adj<- get.adjacency(g1,type=c("both", "upper", "lower"),attr=NULL, names=TRUE, sparse=FALSE)
d<-rowSums(adj)
e$degreeOfV1<-d[e$V1]
e$degofV2<-d[e$V2]
the degree given by this code is not correct.
The problem with this code is that the nodes have be inputted into your graph in a different order than you expected:
V(g1)
# + 7/7 vertices, named:
# [1] 2 3 4 5 6 7 1
The first node in the graph (corresponding to element 1 of your d object) is actually node number 2 in e, element 2 is node number 3 in e, etc.
You can deal with this by using the node names instead of the node numbers when calculating the degrees:
d <- degree(g1)
e$degreeOfV1 <- d[as.character(e$V1)]
e$degreeOfV2 <- d[as.character(e$V2)]
# V1 V2 degreeOfV1 degreeOfV2
# 1 2 1 3 6
# 2 3 1 3 6
# 3 3 2 3 3
# 4 4 1 3 6
# 5 4 2 3 3
# 6 4 3 3 3
# 7 5 1 2 6
# 8 6 1 2 6
# 9 7 1 3 6
# 10 7 5 3 2
# 11 7 6 3 2
Basically the way this works is that degree(g1) returns a named vector of the degrees of each node in your graph:
(d <- degree(g1))
# 2 3 4 5 6 7 1
# 3 3 3 2 2 3 6
When you index by strings (as.character(e$V1) instead of e$V1), then you get the node by name instead of by index number.

Randomly Assign Integers in R within groups without replacement

I am running an experiment with two experiments: experiment_1 and experiment_2. Each experiment has 5 different treatments (i.e. 1, 2, 3, 4, 5). We are trying to randomly assign the treatments within groups.
We would like to do this via sampling without replacement iteratively within each group. We want to do this to insure that we get as a balanced a sample as possible in the treatment (e.g. we don't want to end up with 4 subjects in group 1 getting assigned to treatment 2 and no one getting treatment 1). So if a group has 23 subjects, we want to split the respondent into 4 subgroups of 5, and 1 subgroup of 3. We then want to randomly sample without replacement across the first subgroup of 5, so everyone gets assigned 1 of the treatments, do the same things for the the second, third and 4th subgroup of 5, and for the final subgroup of 3 randomly sample without replacement. So we would guarantee that every treatment is assigned to at least 4 subjects, and 3 are assigned to 5 subjects within this group. We would like to do this for all the groups in the experiment and for both treatments. The resultant output would look something like this...
group experiment_1 experiment_2
[1,] 1 5 3
[2,] 1 3 2
[3,] 1 4 4
[4,] 1 1 5
[5,] 1 2 1
[6,] 1 2 3
[7,] 1 4 1
[8,] 1 3 2
[9,] 2 5 5
[10,] 2 1 4
[11,] 2 3 4
[12,] 2 1 5
[13,] 2 2 1
. . . .
. . . .
. . . .
I know how to use the sample function, but am unsure how to sample without replacement within each group, so that our output corresponds to above described procedure. Any help would be appreciated.
I think we just need to shuffle sample IDs, see this example:
set.seed(124)
#prepare groups and samples(shuffled)
df <- data.frame(group=sort(rep(1:3,9)),
sampleID=sample(1:27,27))
#treatments repeated nrow of df
df$ex1 <- rep(c(1,2,3,4,5),ceiling(nrow(df)/5))[1:nrow(df)]
df$ex2 <- rep(c(2,3,4,5,1),ceiling(nrow(df)/5))[1:nrow(df)]
df <- df[ order(df$group,df$sampleID),]
#check treatment distribution
with(df,table(group,ex1))
# ex1
# group 1 2 3 4 5
# 1 2 2 2 2 1
# 2 2 2 2 1 2
# 3 2 2 1 2 2
with(df,table(group,ex2))
# ex2
# group 1 2 3 4 5
# 1 1 2 2 2 2
# 2 2 2 2 2 1
# 3 2 2 2 1 2
How about this function:
f <- function(n,m) {sample( c( rep(1:m,n%/%m), sample(1:m,n%%m) ), n )}
"n" is the group size, "m" the number of treatments.
Each treatment must be containt at least "n %/% m" times in the group.
The treatment numbers of the remaining "n %% m" group members are
assigned arbitrarily without repetition.
The vector "c( rep(1:m,n%/%m), sample(1:m,n%%m) )" contains these treatment numbers. Finally the "sample" function
perturbes these numbers.
> f(8,5)
[1] 5 3 1 5 4 2 2 1
> f(8,5)
[1] 4 5 3 4 2 2 1 1
> f(8,5)
[1] 4 2 1 5 3 5 2 3
Here is a function that creates a dataframe, using the above function:
Plan <- function( groupSizes, numExp=2, numTreatment=5 )
{
numGroups <- length(groupSizes)
df <- data.frame( group = rep(1:numGroups,groupSizes) )
for ( e in 1:numExp )
{
df <- cbind(df,unlist(lapply(groupSizes,function(n){f(n,numTreatment)})))
colnames(df)[e+1] <- sprintf("Exp_%i", e)
}
return(df)
}
Example:
> P <- Plan(c(8,23,13,19))
> P
group Exp_1 Exp_2
1 1 4 1
2 1 1 4
3 1 2 2
4 1 2 1
5 1 3 5
6 1 5 5
7 1 1 2
8 1 3 3
9 2 5 1
10 2 2 1
11 2 5 2
12 2 1 2
13 2 2 1
14 2 1 4
15 2 3 5
16 2 5 3
17 2 2 4
18 2 5 4
19 2 2 5
20 2 1 1
21 2 4 2
22 2 3 3
23 2 4 3
24 2 2 5
25 2 3 3
26 2 5 2
27 2 1 5
28 2 3 4
29 2 4 4
30 2 4 2
31 2 4 3
32 3 2 5
33 3 5 3
34 3 5 1
35 3 5 1
36 3 2 5
37 3 4 4
38 3 1 4
39 3 3 2
40 3 3 2
41 3 3 3
42 3 1 1
43 3 4 2
44 3 4 4
45 4 5 1
46 4 3 1
47 4 1 2
48 4 1 5
49 4 3 3
50 4 3 1
51 4 4 5
52 4 2 4
53 4 5 3
54 4 2 1
55 4 4 2
56 4 2 5
57 4 4 4
58 4 5 3
59 4 5 4
60 4 1 2
61 4 2 5
62 4 3 2
63 4 4 4
Check the distribution:
> with(P,table(group,Exp_1))
Exp_1
group 1 2 3 4 5
1 2 2 2 1 1
2 4 5 4 5 5
3 2 2 3 3 3
4 3 4 4 4 4
> with(P,table(group,Exp_2))
Exp_2
group 1 2 3 4 5
1 2 2 1 1 2
2 4 5 5 5 4
3 3 3 2 3 2
4 4 4 3 4 4
>
The design of efficient experiments is a science on its own and there are a few R-packages dealing with this issue:
https://cran.r-project.org/web/views/ExperimentalDesign.html
I am afraid your approach is not optimal regarding the resources, no matter how you create the samples...
However this might help:
n <- 23
group <- sort(rep(1:5, ceiling(n/5)))[1:n]
exp1 <- rep(NA, length(group))
for(i in 1:max(group)) {
exp1[which(group == i)] <- sample(1:5)[1:sum(group == i)]
}
Not exactly sure if this meets all your constraints, but you could use the randomizr package:
library(randomizr)
experiment_1 <- complete_ra(N = 23, num_arms = 5)
experiment_2 <- block_ra(experiment_1, num_arms = 5)
table(experiment_1)
table(experiment_2)
table(experiment_1, experiment_2)
Produces output like this:
> table(experiment_1)
experiment_1
T1 T2 T3 T4 T5
4 5 5 4 5
> table(experiment_2)
experiment_2
T1 T2 T3 T4 T5
6 3 6 4 4
> table(experiment_1, experiment_2)
experiment_2
experiment_1 T1 T2 T3 T4 T5
T1 2 0 1 1 0
T2 1 1 1 1 1
T3 1 1 1 1 1
T4 1 0 2 0 1
T5 1 1 1 1 1

Combine minimum values of row and column in matrix

Suppose I have a vector of size n=8 v=(5,8,2,7,9,12,2,1). I would like to know how to build a N x N matrix that compares every pair of values of v and returns the minimum value of each comparation. In this example, it would be like this:
5 5 2 5 5 5 2 1
5 8 2 7 8 8 2 1
2 2 2 2 2 2 2 1
5 7 2 7 7 7 2 1
5 8 2 7 9 9 2 1
5 8 2 7 9 12 2 1
2 2 2 2 2 2 2 1
1 1 1 1 1 1 1 1
Could you help me with this, please?
outer(v, v, pmin)
Notice the use of pmin, not min, as the former is vectorised but not the latter.

Multiple plot by group by one function

I have the following data:
Animal MY Age
1 17.03672067 1
1 17.00833641 2
1 16.97995215 3
1 16.95156788 4
1 16.92318362 5
1 16.88157748 6
2 16.83997133 2
2 16.79836519 3
2 16.75675905 4
2 16.7151529 5
2 16.67354676 6
2 16.63194062 7
3 16.59033447 1
3 16.54872833 2
3 16.50712219 3
3 16.46551604 4
3 16.4239099 5
3 16.38230376 6
4 16.34069761 1
4 16.29909147 2
4 16.25748533 3
4 16.21587918 4
4 16.17427304 5
4 16.1326669 6
I want to plot a scatter plot between MY vs Age for each animal. I use this function
plot(memo$MY[memo$Animal=="1223100747"]~memo$Age[memo$Animal=="1223100747"]).
If I now want to add a same plot (MY vs Age) for another animals, I just need to use function: lines.
However, since I have about 200 animals I do not want to do this manually 100 times. My questions is that: How can I plot these different animals by one function?, instead of using lines, lines ....lines)
Regards,
Phuong
You can use by for example :
by(memo,memo$Animal,FUN=function(x) plot(x$MY~x$Age))
You could use a loop or a matplot if you want to use base R, but I advise you to use package ggplot2.
DF <- read.table(text="Animal MY Age
1 17.03672067 1
1 17.00833641 2
1 16.97995215 3
1 16.95156788 4
1 16.92318362 5
1 16.88157748 6
2 16.83997133 2
2 16.79836519 3
2 16.75675905 4
2 16.7151529 5
2 16.67354676 6
2 16.63194062 7
3 16.59033447 1
3 16.54872833 2
3 16.50712219 3
3 16.46551604 4
3 16.4239099 5
3 16.38230376 6
4 16.34069761 1
4 16.29909147 2
4 16.25748533 3
4 16.21587918 4
4 16.17427304 5
4 16.1326669 6",header=TRUE)
library(ggplot2)
DF$Animal <- factor(DF$Animal)
p1 <- ggplot(DF,aes(x=MY,y=Age,colour=Animal)) + geom_line()
print(p1)

Resources