Plotting Minimum Distance within clusters and linkage distance between clusters - r

I using Hierarchical average-linkage method to do clustering using Euclidean distance. To find cluster number (k) to cut I need to do two plots one for Minimum Distance within clusters against number of cluster (graph 1) and one for linkage distance between clusters against number of cluster (graph 2).
> df
Site1 Site2 Site3 Site4 Site5 Site6
1985 11 0 5 15 13 15
1986 12 12 5 31 14 26
1987 23 21 17 14 25 12
1988 22 25 18 17 24 14
1989 11 16 8 18 13 19
1990 7 5 21 8 9 24
1991 20 13 9 21 22 7
1992 15 11 6 19 17 20
1993 19 18 9 11 21 11
1994 33 9 28 17 26 20
1995 16 14 19 33 17 10
1996 14 21 25 4 6 47
1997 4 0 11 22 14 16
1998 10 31 13 26 12 14
1999 24 17 18 41 19 20
2000 21 17 23 19 23 14
2001 12 8 6 7 19 20
2002 19 24 19 31 24 17
2003 13 29 10 28 7 9
2004 19 14 19 22 20 13
2005 16 8 9 10 11 13
2006 8 9 46 9 20 19
2007 12 10 15 13 10 9
2008 12 18 25 12 47 22
2009 19 18 18 23 21 20
2010 23 10 46 35 25 12
2011 20 35 18 30 22 18
2012 23 13 23 34 25 34
2013 17 28 20 13 19 21
2014 19 22 16 16 21 23
df2 <- data.frame(t(df))
tree <- hclust(dist(df2))

Since there's no question stated, I'm assuming that you are interested to plot the figure above with the example data-set. Please correct if I'm wrong with that assumption.
(i) find the number of groups based on sequence linkage distances. Sequence of linkage distance in this case was eyeballed from plot(tree):
library(dplyr)
cls.df <- data.frame(h=40:100)
cls.df$k <- sapply(cls.df$h, function(x) cutree(tree, h=x) %>% max )
(ii) clean the table by retaining only the minimum linkages distance h for number of group k
cls.df <- cls.df %>%
group_by(k) %>%
summarise(h=min(h))
(iii) Plot:
library(ggplot2)
ggplot(cls.df, aes(k, h)) +
geom_line() +
geom_point() +
theme_bw() +
ylab("Linkage Distance") +
xlab("Number of Cluster")

Related

Generating a vector with n repetitions of x, then y, then z, with a fixed upper bound

I am trying to create a vector where I have 3 repetitions of the number 1, then 3 repetitions of the number 2, and so on up to, for instance, 3 repetitions of the number 36.
c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5...)
I have tried the following use of rep() but got the following error:
Error in rep(3, seq(1:36)) : argument 'times' incorrect
What formulation do I need to use to properly generate the vector I want?
sort(rep(1:36, 3))
Or even better as #Wimpel mentioned in the comments, use the each argument of the rep function.
rep(1:36, each = 3)
output
# [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22
# [65] 22 22 23 23 23 24 24 24 25 25 25 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34 34 34 35 35 35 36 36 36
This one should work. However probably not the most elegant.
reps = c()
n = 36
for(i in 1:n){
reps = append(reps, rep(i, 3))
}
reps
alternatively using the rep function properly (see documentation (?rep for argument each):
rep(1:36,each = 3)
rep approach is preferable (see existing answers)
Here are some other options:
> kronecker(1:36, rep(1, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9
[26] 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17
[51] 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25
[76] 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34
[101] 34 34 35 35 35 36 36 36
> c(outer(rep(1, 3), 1:36))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9
[26] 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17
[51] 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25
[76] 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34
[101] 34 34 35 35 35 36 36 36

which.min() returning two numbers

I need the position of the smallest value in my vector (degrees in a graph, got from function degree()). I use the which.min().
However as the vector itself is "anotated", I get two values - the name of the node and the position in the vector (which I have no idea why they are not in the right order) - here node "23" has the smallest degree and it is in the 40th position in the vector. They appear on top of each other and I cannot figure out how to separate them.
I need to use just the name of the node for further applications. I couldn't find any question about this issue.
> degs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 24 25 26 27 28 29 30 31 32 34 35 36 38 39 40 41 33 23 37 42 43
14 25 31 17 25 11 26 21 23 25 24 17 13 20 12 15 7 15 28 18 9 17 8 7 7 7 14 19 12 17 19 10 19 20 19 10 7 11 12 6 8 12 13
> which.min(degs)
23
40
The top number is just the name of the value and you can ignore it, see?
> c("23" = 40)
23
40
If you want just the name of the node, you can use
names(which.min(degs))
Output will be "23".

Display vector in R with a defined viewport

I want to display a vector consistently in different R environment.
For example, for a vector like this
c(1:30)
will display 24 values per row
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[25] 25 26 27 28 29 30
and not
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
The closest thing to what you are looking for is to use options() to configure the width of the results window:
options(width = 75)
c(1:30)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30

Rename several columns (variable number)

I have a dataset where the last columns indicate the number of stops extracted from that dataset.
ColA ColB ColC 1 2 3 4 5 6 7 8 9 10 (...)
a g c a q e r e r q g h q (...)
What I want is to select from column 1, until the last column, and add Stop before it, ending up with Stop1, Stop2, etc...
The problem is that those columns can vary. Sometimes I have 10 after 1 other times I have 6.
I've tried with dplyr and data.table but I'm not sure how to automate this.
EDIT: ColA to ColC are fixed and always the same.
If I correctly understood your problem, this is a sufficiently flexible code that should solve your problem. Start considering the following dataset:
set.seed(1)
df <- data.frame(matrix(rpois(130, 20),ncol=13))
names(df) <- c(paste("Col",LETTERS[1:3],sep=""),as.character(1:10))
df
#######
ColA ColB ColC 1 2 3 4 5 6 7 8 9 10
1 17 21 20 13 13 15 29 25 16 15 12 23 17
2 25 17 11 24 23 14 22 23 25 14 18 19 15
3 25 18 22 18 19 30 16 19 23 27 18 19 11
4 21 18 24 25 23 19 19 18 27 23 18 16 18
5 13 21 16 18 21 23 22 18 22 24 22 26 15
6 22 16 17 27 17 20 24 24 14 21 19 17 15
7 23 23 18 22 16 16 20 18 21 27 17 22 14
8 22 22 17 17 26 13 19 25 24 17 15 13 20
9 18 24 21 22 28 26 15 22 23 20 19 15 27
10 26 23 19 16 18 20 17 25 16 20 19 18 19
Now rename columuns as required:
k <- which(names(df)=="1")
names(df)[k:ncol(df)] <- paste("Stop",1:(ncol(df)-k+1),sep="")
df
#############
ColA ColB ColC Stop1 Stop2 Stop3 Stop4 Stop5 Stop6 Stop7 Stop8 Stop9 Stop10
1 17 21 20 13 13 15 29 25 16 15 12 23 17
2 25 17 11 24 23 14 22 23 25 14 18 19 15
3 25 18 22 18 19 30 16 19 23 27 18 19 11
4 21 18 24 25 23 19 19 18 27 23 18 16 18
5 13 21 16 18 21 23 22 18 22 24 22 26 15
6 22 16 17 27 17 20 24 24 14 21 19 17 15
7 23 23 18 22 16 16 20 18 21 27 17 22 14
8 22 22 17 17 26 13 19 25 24 17 15 13 20
9 18 24 21 22 28 26 15 22 23 20 19 15 27
10 26 23 19 16 18 20 17 25 16 20 19 18 19
I hope it can help you.

Splitting and iterative simple regression in r

I am pretty much new to r and I have a dummy example of a bigger table underneath. I want to split the table based on id (a,b,c,d) and do iterative simple linear regression for every subset:
x is my x variable, and columns 1:6 are y variables, to have an output of each id and each column from 1:6. Also, it would be great if I could output the model p values of the slopes into a new data frame
id x 1 2 3 4 5 6
1 a 74 18 19 NA 23 29 1
2 a 77 16 19 17 22 29 2
3 a 79 16 NA 19 23 29 3
4 a 81 17 20 18 23 29 4
5 b 74 19 20 19 23 28 11
6 b 76 15 19 18 26 28 12
7 b 79 19 21 20 24 28 NA
8 b 81 19 21 20 23 28 14
9 c 68 19 20 20 23 29 8
10 c 70 17 22 22 27 29 9
11 c 73 18 22 21 23 29 10
12 c 75 19 20 19 23 29 11
13 d 65 18 18 19 22 28 5
14 d 68 18 NA 18 20 29 6
15 d 70 18 19 18 23 28 7
16 d 72 19 17 19 22 28 8
I tried to do use plyr package but it didn't work out
regression = NULL
for ( i in 3:ncol(dumm)){
regression[i] <- dlply(dumm, .(id), function(z) lm(dumm[,i]~dumm$x, z))
}
coefs <- ldply(regression, coef)
Thanks in advance!

Resources