Function to apply a nested clusters to class "dist" - r

I have a data.frame called mydf.
x y
0 A
1 A
2 A
3 A
0 B
2 B
0 C
3 C
3 D
...(20,000 rows)
I am using the GMD package (elbow method) to automatically identify clusters and decide the number of clusters.
library("GMD")
dist.obj <- dist(mydf$x[mydf$y=="A"])
hclust.obj <- hclust(dist.obj)
css.obj <- css.hclust(dist.obj,hclust.obj)
elbow.obj <- elbow.batch(css.obj)
k <- elbow.obj$k
cutree.obj <- cutree(hclust.obj,k=k)
mydf$cluster <- cutree.obj
I would like to apply those scripts for all categories (A, B, C, D, etc) in column y automatically, don't need to repeat the scripts one after one.
Problem 1: I got the error "Error: cannot allocate vector of size 2.1 Gb" when process this step:
css.obj <- css.hclust(dist.obj,hclust.obj)
Problem 2: I can process this, but am stuck and can't get any further.
dist.obj <- lapply(split(mydf, mydf$y), dist)
The desired result for mydf is
x y cluster
0 A 1
1 A 1
2 A 2
3 A 3
0 B 1
2 B 2
0 C 1
3 C 2
3 D 1
Can you please help me? Any solution is well appreciated. Cheers!

Related

How can I automate a basic genetic distance matrix in R?

I'm trying to create an algorithm that would produce a distance matrix from a dataframe. The idea is that the dataframe will contain three or more aligned genetic sequences and the algorithm will calculate the number of differences between each sequence and convert this into a dataframe. Hence, the input data would look something like this:
taxon1 taxon2 taxon3
1 g g g
2 a c c
3 a a a
4 a t c
5 g g g
6 c t t
So far, I have the following code to calculate the difference between two sequences (taxon 1 and taxon 2):
distance1_2 <- 0
for (i in 1:length(taxon1)){
if (taxon1[i] == taxon2[i]){
distance1_2 <- distance1_2
}
else{
distance1_2 <- distance1_2 + 1
}
}
distance1_2
How can I automate this without manually repeating the same code for each individual taxon combination? The finished matrix should look something like this:
t1 t2 t3
t1 0 4 5
t2 4 0 5
t3 5 5 0
I am not sure whether it is the following you want:
outer(df, df, Vectorize(\(x,y) sum(x != y)))
#> taxon1 taxon2 taxon3
#> taxon1 0 3 3
#> taxon2 3 0 1
#> taxon3 3 1 0

How to compute total within sum of square in hierarchical clustering

I read several textbook and online tutorials about clustering algorithms. In K-mean algorithm, when you run kmean() the information of total within sum of square is included. But we runhclust()in agglomerative hierarchical clustering, we can not find this information. So is that possible to compute TWSS for hclust()? Or is is reasonable to calculate the TWSS in hclust()?
The original data set is something like this:
1 -1.6768555093 -1.33937070 1.246858892 1.23171108 2.186761
2 -3.0832450282 1.28841533 0.286807651 1.54836547 3.494282
3 -1.4664760903 0.80289181 1.940444140 1.84226142 3.543522
4 -3.1109618863 0.32801815 -0.497680172 2.54236639 2.501975
5 -2.7603333486 0.49249130 1.041125723 1.75577604 2.868788
6 -4.3145154475 -2.01808802 1.227723818 0.09547962 2.570594
7 -1.6097707596 0.25391455 2.978627043 0.07428535 4.510882
Below is my code. In here, minClusters = 1, maxClusters = 10
hierarchy_mod <- hclust(Eucli_dis,method = "complete")
memb <- cutree(hierarchy_mod,minClusters:maxClusters)
memb_DT <- data.table(memb)
I got the result of a matrix and transfer it to data.table:
1 2 3 4 5 6 7 8 9 10
1: 1 1 1 1 1 1 1 1 1 1
2: 1 1 1 1 1 1 1 1 2 2
3: 1 1 1 1 1 1 1 1 2 2
4: 1 1 1 1 1 1 1 1 1 1
5: 1 1 1 1 1 1 1 1 2 2
...
The problem for me now is I don't know how to compute the TWSS in this scenario. I checked on-line tutorial and text books but in hclust(), no one calculate the TWSS...
Thank you!
TWSS is useful in comparing different results using kmeans because the starting configuration is usually random so different runs can give different results. That does not happen in hierarchical clustering since the cluster process is deterministic. But you can easily write R commands to compute it for any cluster result. First we need to make a reproducible example:
set.seed(4242)
x <- matrix(rnorm(125), 25, 5)
x.dist <- dist(x)
x.clus <- hclust(x.dist, metho="complete")
plot(x.clus)
x.grps <- cutree(x.clus, 3:5)
We are clustering 25 rows (cases) by 5 columns (variables). We will look at solutions involving 3 to 5 clusters. We can use the scale() function to compute the sums of squares by cluster and then sum them:
x.SS <- aggregate(x, by=list(x.grps[, 1]), function(x) sum(scale(x,
scale=FALSE)^2))
x.SS
SS <- rowSums(x.SS[, -1]) # Sum of squares for each cluster
TSS <- sum(x.SS[, -1]) # Total (within) sum of squares
You will have to run this code for x.grps[, 1], x.grps[, 2], and x.grps[, 3]. Or make it into a function and use apply() to get them all:
TSS <- function(x, g) {
sum(aggregate(x, by=list(g), function(x) sum(scale(x,
scale=FALSE)^2))[, -1])
}
TSS.all <- apply(x.grps, 2, function(g) TSS(x, g))
TSS.all

Improve my coding "for loop"

The following is a simple loop to insert a new column in a data frame after checking a specific condition (if 2 consecutive rows have the same value).
The code works just fine but I would like to improve my coding skills so I ask for alternative solutions (faster, more elegant).
I checked previous threads on the topic and learned a lot but I am curious about my specific case.
Thanks for any input.
vector<-1
vector_tot<-NULL
for(i in 1:length(dat$Label1))
{
vector_tot<-c(vector_tot,vector)
if(dat$Label1[i]==dat$Label1[i+1]){
vector<-0
}
else {
vector<-1
}
}
dat$vector<- vector_tot
For many things in R, you do not need a for loop, since functions are vectorized. So we can achieve what you want with:
# sample data
dat <- data.frame(Label1=c("A","B","B","C","C","C","D"),stringsAsFactors = F)
# first create a vector that contains the previous value
dat$next_element <- c(dat$Label1[2:nrow(dat)],"")
# then check if they match
dat$vector <- as.numeric(dat$Label1==dat$next_element)
Output:
Label1 next_element vector
1 A B 0
2 B B 1
3 B C 0
4 C C 1
5 C C 1
6 C D 0
7 D 0
It can also be done in one line, but I think the above illustrates better how it works:
dat$vector <- dat$Label1==c(dat$Label1[2:nrow(dat)],"")
Or compare with the previous element:
dat$vector <- dat$Label1==c("",dat$Label1[1:nrow(dat)-1])
You can do this in one line...
library(dplyr) #for the 'lead' function
dat = data.frame(Label1=c("A","B","B","C","C","C","D"),stringsAsFactors = F)
dat$vector <- as.numeric(dat$Label1!=lead(dat$Label1,default = ""))
dat
Label1 vector
1 A 1
2 B 0
3 B 1
4 C 0
5 C 0
6 C 1
7 D 1

How can I loop a data matrix in R?

I am trying to loop a data matrix for each separate ID tag, “1”, “2” and “3” (see my data at the bottom). Ultimately I am doing this to transform the X and Y coordinates into a timeseries with the ts() function, but first i need to build a loop into the function that returns a timeseries for each separate ID. The looping itself works perfectly fine when I use the following code for a dataframe:
for(i in 1:3){
print(na.omit(xyframe[ID==i,]))
}
Returning the following output:
Timestamp X Y ID
1. 0 -34.012 3.406 1
2. 100 -33.995 3.415 1
3. 200 -33.994 3.427 1
Timestamp X Y ID
4. 0 -34.093 3.476 2
5. 100 -34.145 3.492 2
6. 200 -34.195 3.506 2
Timestamp X Y ID
7. 0 -34.289 3.522 3
8. 100 -34.300 3.520 3
9. 200 -34.303 3.517 3
Yet, when I want to produce a loop in a matrix with the same code:
for(i in 1:3){
print(na.omit(xymatrix[ID==i,])
}
It returns the following error:
Error in print(na.omit(xymatrix[ID == i, ]) :
(subscript) logical subscript too long
Why does it not work to loop the ID through a matrix while it does work for the dataframe and how would I be able to fix it?
Furthermore did I read that looping requires much more computational strength then doing the same thing vector based, would there be a way to do this vector based?
The data (simplification of the real data):
Timestamp X Y ID
1. 0 -34.012 3.406 1
2. 100 -33.995 3.415 1
3. 200 -33.994 3.427 1
4. 0 -34.093 3.476 2
5. 100 -34.145 3.492 2
6. 200 -34.195 3.506 2
7. 0 -34.289 3.522 3
8. 100 -34.300 3.520 3
9. 200 -34.303 3.517 3
The format xymatrix[ID==i,] doesn't work for matrix. Try this way:
for(i in 1:3){ print(na.omit(xymatrix[xymatrix[,'ID'] == i,])) }
In general, if you want to apply a function to a data frame, split by some factor, then you should be using one of the apply family of functions in combination with split.
Here's some reproducible sample data.
n <- 20
some_data <- data.frame(
x = sample(c(1:5, NA), n, replace= TRUE),
y = sample(c(letters[1:5], NA), n, replace= TRUE),
id = gl(3, 1, length = n)
)
If you want to print out the rows with no missing values, split by each ID level, then you want something like this.
lapply(split(some_data, some_data$grp), na.omit)
or more concisely using the plyr package.
library(plyr)
dlply(some_data, .(grp), na.omit)
Both methods return output like this
# $`1`
# x y grp
# 1 2 d 1
# 4 3 e 1
# 7 3 c 1
# 10 4 a 1
# 13 2 e 1
# 16 3 a 1
# 19 1 d 1
# $`2`
# x y grp
# 2 1 e 2
# 5 3 e 2
# 8 3 b 2
# $`3`
# x y grp
# 6 3 c 3
# 9 5 a 3
# 12 2 c 3
# 15 2 d 3
# 18 4 a 3

Generating random number by length of blocks of data in R data frame

I am trying to simulate n times the measuring order and see how measuring order effects my study subject. To do this I am trying to generate integer random numbers to a new column in a dataframe. I have a big dataframe and i would like to add a column into the dataframe that consists a random number according to the number of observations in a block.
Example of data(each row is an observation):
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5))
A B C
1 1 x 1
2 1 b 2
3 1 c 4
4 2 g 1
5 2 h 5
6 3 g 7
7 3 g 1
8 3 u 2
9 3 l 5
What I'd like to do is add a D column and generate random integer numbers according to the length of each block. Blocks are defined in column A.
Result should look something like this:
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5),
D=c(2,1,3,2,1,4,3,1,2))
> df
A B C D
1 1 x 1 2
2 1 b 2 1
3 1 c 4 3
4 2 g 1 2
5 2 h 5 1
6 3 g 7 4
7 3 g 1 3
8 3 u 2 1
9 3 l 5 2
I have tried to use R:s sample() function to generate random numbers but my problem is splitting the data according to block length and adding the new column. Any help is greatly appreciated.
It can be done easily with ave
df$D <- ave( df$A, df$A, FUN = function(x) sample(length(x)) )
(you could replace length() with max(), or whatever, but length will work even if A is not numbers matching the length of their blocks)
This is really easy with ddply from plyr.
ddply(df, .(A), transform, D = sample(length(A)))
The longer manual version is:
Use split to split the data frame by the first column.
split_df <- split(df, df$A)
Then call sample on each member of the list.
split_df <- lapply(split_df, function(df)
{
df$D <- sample(nrow(df))
df
})
Then recombine with
df <- do.call(rbind, split_df)
One simple way:
df$D = 0
counts = table(df$A)
for (i in 1:length(counts)){
df$D[df$A == names(counts)[i]] = sample(counts[i])
}

Resources