How to compute total within sum of square in hierarchical clustering - r

I read several textbook and online tutorials about clustering algorithms. In K-mean algorithm, when you run kmean() the information of total within sum of square is included. But we runhclust()in agglomerative hierarchical clustering, we can not find this information. So is that possible to compute TWSS for hclust()? Or is is reasonable to calculate the TWSS in hclust()?
The original data set is something like this:
1 -1.6768555093 -1.33937070 1.246858892 1.23171108 2.186761
2 -3.0832450282 1.28841533 0.286807651 1.54836547 3.494282
3 -1.4664760903 0.80289181 1.940444140 1.84226142 3.543522
4 -3.1109618863 0.32801815 -0.497680172 2.54236639 2.501975
5 -2.7603333486 0.49249130 1.041125723 1.75577604 2.868788
6 -4.3145154475 -2.01808802 1.227723818 0.09547962 2.570594
7 -1.6097707596 0.25391455 2.978627043 0.07428535 4.510882
Below is my code. In here, minClusters = 1, maxClusters = 10
hierarchy_mod <- hclust(Eucli_dis,method = "complete")
memb <- cutree(hierarchy_mod,minClusters:maxClusters)
memb_DT <- data.table(memb)
I got the result of a matrix and transfer it to data.table:
1 2 3 4 5 6 7 8 9 10
1: 1 1 1 1 1 1 1 1 1 1
2: 1 1 1 1 1 1 1 1 2 2
3: 1 1 1 1 1 1 1 1 2 2
4: 1 1 1 1 1 1 1 1 1 1
5: 1 1 1 1 1 1 1 1 2 2
...
The problem for me now is I don't know how to compute the TWSS in this scenario. I checked on-line tutorial and text books but in hclust(), no one calculate the TWSS...
Thank you!

TWSS is useful in comparing different results using kmeans because the starting configuration is usually random so different runs can give different results. That does not happen in hierarchical clustering since the cluster process is deterministic. But you can easily write R commands to compute it for any cluster result. First we need to make a reproducible example:
set.seed(4242)
x <- matrix(rnorm(125), 25, 5)
x.dist <- dist(x)
x.clus <- hclust(x.dist, metho="complete")
plot(x.clus)
x.grps <- cutree(x.clus, 3:5)
We are clustering 25 rows (cases) by 5 columns (variables). We will look at solutions involving 3 to 5 clusters. We can use the scale() function to compute the sums of squares by cluster and then sum them:
x.SS <- aggregate(x, by=list(x.grps[, 1]), function(x) sum(scale(x,
scale=FALSE)^2))
x.SS
SS <- rowSums(x.SS[, -1]) # Sum of squares for each cluster
TSS <- sum(x.SS[, -1]) # Total (within) sum of squares
You will have to run this code for x.grps[, 1], x.grps[, 2], and x.grps[, 3]. Or make it into a function and use apply() to get them all:
TSS <- function(x, g) {
sum(aggregate(x, by=list(g), function(x) sum(scale(x,
scale=FALSE)^2))[, -1])
}
TSS.all <- apply(x.grps, 2, function(g) TSS(x, g))
TSS.all

Related

How can I automate a basic genetic distance matrix in R?

I'm trying to create an algorithm that would produce a distance matrix from a dataframe. The idea is that the dataframe will contain three or more aligned genetic sequences and the algorithm will calculate the number of differences between each sequence and convert this into a dataframe. Hence, the input data would look something like this:
taxon1 taxon2 taxon3
1 g g g
2 a c c
3 a a a
4 a t c
5 g g g
6 c t t
So far, I have the following code to calculate the difference between two sequences (taxon 1 and taxon 2):
distance1_2 <- 0
for (i in 1:length(taxon1)){
if (taxon1[i] == taxon2[i]){
distance1_2 <- distance1_2
}
else{
distance1_2 <- distance1_2 + 1
}
}
distance1_2
How can I automate this without manually repeating the same code for each individual taxon combination? The finished matrix should look something like this:
t1 t2 t3
t1 0 4 5
t2 4 0 5
t3 5 5 0
I am not sure whether it is the following you want:
outer(df, df, Vectorize(\(x,y) sum(x != y)))
#> taxon1 taxon2 taxon3
#> taxon1 0 3 3
#> taxon2 3 0 1
#> taxon3 3 1 0

How to plot only large communities/clusters in R

I have an igraph in g. Since the graph is huge I only want to plot communities with more than 10 members, but I want to plot them all in one plot.
My idea to remove unwanted elements is:
g <- delete_vertices(g, V(g)[igraph::clusters(g)$csize < 10])
but for some reason this plots a lot of single nodes, which is the opposite of what I try to achieve. Can you tell me where I am wrong?
Your idea is great, but the problem is that
igraph::clusters(g)$csize < 10
only returns a logical vector of clusters containing fewer than 10 members. Meanwhile, you need to know which vertices belong to those clusters.
Hence, we may proceed as follows.
set.seed(1)
g1 <- erdos.renyi.game(100, 1 / 70)
cls <- clusters(g1)
cls$csize
# [1] 1 1 43 2 11 1 1 1 2 1 2 5 1 1 4 4 1 1 1 1 2 1 2 1
# [25] 4 1 1 1 1 1 # Two clusters of interest
g2 <- delete_vertices(g1, V(g1)[cls$membership %in% which(cls$csize <= 10)])
plot(g2)

Unpaired but not paired ttest loop in R working

I have a loop that goes through a dataframe, runs ttests and stores the resulting p-value of each ttest in another dataframe.
Here is the loop where 'mydata' is the dataframe that the ttests are being run on. 'mydata' is a dataframe with 4 columns:
df <- mydata
mydf <- data.frame(c(1:4))
# this is the new dataframe being initialized to store my p-values
row.names(mydf) <- names(df)
for(i in names(df)){
if(sd(df[[i]]) == 0) {
# this prevents the loop from terminating and returning an error when ttests
# are run on columns with binary values
} else {
ttest <- t.test(df[df$Pre==1,][[i]], df[df$Pre==2,][[i]], paired=FALSE)
# 'Pre' is the column that groups my data into
# distinct cohorts. I am comparing the Pre cohort versus the Post cohort
# in these ttests.
mydf[i,1] <- ttest$p.value
}
}
mydf
Here is my output of mydf for an unpaired (paired=FALSE) ttest:
c.1.4.
density 0.3569670
clust 0.9715987
Pre 3.0000000
HC 4.0000000
However, when I change paired=FALSE to paired=TRUE (to run a paired ttest), here is mydf:
c.1.4.
density 1
clust 2
Pre 3
HC 4
I checked this line of my loop in isolation using the first column of my dataframe, '1' in double brackets,(for paired=TRUE) and it does appear to be outputting a p-value:
ttest <- t.test(df[df$Pre==1,][[1]], df[df$Pre==2,][[1]], paired=TRUE)
ttest$p.value
[1] 0.356967
Below is a sample dataset that you can use to reproduce the error:
density clust Pre HC
RDHC008A_13 0.47991 0.676825 1 1
RDHC009A_13 0.49955 0.696441 1 1
RDHC010A_16 0.491454 0.706507 1 1
RDHC013A_13 0.442879 0.689118 1 1
RDHC014A_13 0.453823 0.691603 1 1
RDHC016A_16 0.481259 0.706978 1 1
RDHC019A_06 0.515442 0.699514 1 1
RDHC021A_15 0.449925 0.685202 1 1
RDHC022A_12 0.461319 0.705446 1 1
RDHC023A_11 0.468816 0.667698 1 1
RDHC024A_12 0.515142 0.719474 1 1
RDHC025A_13 0.496702 0.710877 1 1
RDHC026A_12 0.477061 0.695061 1 1
RDHC027A_12 0.515442 0.722269 1 1
RDHC029A_12 0.406747 0.669998 1 1
RDHC030A_12 0.476162 0.69219 1 1
RDHC032B_13 0.50075 0.685474 1 1
RDHC034B_07 0.525487 0.725558 1 1
RDHC036B_07 0.468816 0.698904 1 1
RDHC038B_07 0.470015 0.706668 1 1
RDHC039B_07 0.511544 0.712818 1 1
RDHC041A_14 0.551574 0.732983 1 1
RDHC004C_12 0.486207 0.695121 2 1
RDHC005C_12 0.505997 0.695598 2 1
RDHC006C_13 0.487406 0.697044 2 1
RDHC013C_12 0.41979 0.685518 2 1
RDHC015C_13 0.297751 0.69632 2 1
RDHC016C_16 0.463718 0.700011 2 1
RDHC019C_14 0.508096 0.690071 2 1
RDHC021C_12 0.448426 0.688265 2 1
RDHC022C_12 0.468816 0.700968 2 1
RDHC024C_12 0.515292 0.70664 2 1
RDHC025C_13 0.473163 0.704231 2 1
RDHC027C_12 0.518741 0.732939 2 1
RDHC030C_11 0.489205 0.708174 2 1
You can import it by doing the following:
copy the data and paste it within the quotation marks of the code below into R:
zz <- ""
now, assign the data to a data.frame:
mydata <- read.table(text=zz, header=TRUE)
I have no idea why changing the 'paired' parameter to TRUE would cause this to happen. Any help/advice would be much appreciated. Thanks - Paul
You initialize the mydf data.frame with the values 1:4 here
mydf <- data.frame(c(1:4))
basically the loop does nothing because t.test is throwing an error when you do PAIRED=TRUE because your two sets of values aren't the same length (and they need to be when doing a paired t-test. You have 22 values where Pre==1 and 13 values where Pre==2. You can't do a paired test with an imbalance like that.

loop ordinal regression statistical analysis and save the data R

could you, please, help me with a loop? I am relatively new to R.
The short version of the data looks ike this:
sNumber blockNo running TrialNo wordTar wordTar1 Freq Len code code2
1 1 1 5 spouse violent 5011 6 1 2
1 1 1 5 violent spouse 17873 7 2 1
1 1 1 5 spouse aviator 5011 6 1 1
1 1 1 5 aviator wife 515 7 1 1
1 1 1 5 wife aviator 87205 4 1 1
1 1 1 5 aviator spouse 515 7 1 1
1 1 1 9 stability usually 12642 9 1 3
1 1 1 9 usually requires 60074 7 3 4
1 1 1 9 requires client 25949 8 4 1
1 1 1 9 client requires 16964 6 1 4
2 2 1 5 grimy cloth 757 5 2 1
2 2 1 5 cloth eats 8693 5 1 4
2 2 1 5 eats whitens 3494 4 4 4
2 2 1 5 whitens woman 18 7 4 1
2 2 1 5 woman penguin 162541 5 1 1
2 2 1 9 pie customer 8909 3 1 1
2 2 1 9 customer sometimes 13399 8 1 3
2 2 1 9 sometimes reimburses 96341 9 3 4
2 2 1 9 reimburses sometimes 65 10 4 3
2 2 1 9 sometimes gangster 96341 9 3 1
I have a code for ordinal regression analysis for one participant for one trial (eye-tracking data - eyeData) that looks like this:
#------------set the path and import the library-----------------
setwd("/AscTask-3/Data")
library(ordinal)
#-------------read the data----------------
read.delim(file.choose(), header=TRUE) -> eyeData
#-------------extract 1 trial from one participant---------------
ss <- subset(eyeData, sNumber == 6 & runningTrialNo == 21)
#-------------delete duplicates = refixations-----------------
ss.s <- ss[!duplicated(ss$wordTar), ]
#-------------change the raw frequencies to log freq--------------
ss.s$lFreq <- log(ss.s$Freq)
#-------------add a new column with sequential numbers as a factor ------------------
ss.s$rankF <- as.factor(seq(nrow(ss.s)))
#------------ estimate an ordered logistic regression model - fit ordered logit model----------
m <- clm(rankF~lFreq*Len, data=ss.s, link='probit')
summary(m)
#---------------get confidence intervals (CI)------------------
(ci <- confint(m))
#----------odd ratios (OR)--------------
exp(coef(m))
The eyeData file is a huge massive of data consisting of 91832 observations with 11 variables. In total there are 41 participants with 78 trials each. In my code I extract data from one trial from each participant to run the anaysis. However, it takes a long time to run the analysis manually for all trials for all participants. Could you, please, help me to create a loop that will read in all 78 trials from all 41 participants and save the output of statistics (I want to save summary(m), ci, and coef(m)) in one file.
Thank you in advance!
You could generate a unique identifier for every trial of every particpant. Then you could loop over all unique values of this identifier and subset the data accordingly. Then you run the regressions and save the output as a R object
eyeData$uniqueIdent <- paste(eyeData$sNumber, eyeData$runningTrialNo, sep = "-")
uniqueID <- unique(eyeData$uniqueIdent)
for (un in uniqueID) {
ss <- eyeData[eyeData$uniqueID == un,]
ss <- ss[!duplicated(ss$wordTar), ] #maybe do this outside the loop
ss$lFreq <- log(ss$Freq) #you could do this outside the loop too
#create DV
ss$rankF <- as.factor(seq(nrow(ss)))
m <- clm(rankF~lFreq*Len, data=ss, link='probit')
seeSumm <- summary(m)
ci <- confint(m)
oddsR <- exp(coef(m))
save(seeSumm, ci, oddsR, file = paste("toSave_", un, ".Rdata", sep = ""))
# add -un- to the output file to be able identify where it came from
}
Variations of this could include combining the output of every iteration in a list (create an empty list in the beginning) and then after running the estimations and the postestimation commands combine the elements in a list and recursively fill the previously created list "gatherRes":
gatherRes <- vector(mode = "list", length = length(unique(eyeData$uniqueIdent) ##before the loop
gatherRes[[un]] <- list(seeSum, ci, oddsR) ##last line inside the loop
If you're concerned with speed, you could consider writing a function that does all this and use lapply (or mclapply).
Here is a solution using the plyr package (it should be faster than a for loop).
Since you don't provide a reproducible example, I'll use the iris data as an example.
First make a function to calculate your statistics of interest and return them as a list. For example:
# Function to return summary, confidence intervals and coefficients from lm
lm_stats = function(x){
m = lm(Sepal.Width ~ Sepal.Length, data = x)
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
Then use the dlply function, using your variables of interest as grouping
data(iris)
library(plyr) #if not installed do install.packages("plyr")
#Using "Species" as grouping variable
results = dlply(iris, c("Species"), lm_stats)
This will return a list of lists, containing output of summary, confint and coef for each species.
For your specific case, the function could look like (not tested):
ordFit_stats = function(x){
#Remove duplicates
x = x[!duplicated(x$wordTar), ]
# Make log frequencies
x$lFreq <- log(x$Freq)
# Make ranks
x$rankF <- as.factor(seq(nrow(x)))
# Fit model
m <- clm(rankF~lFreq*Len, data=x, link='probit')
# Return list of statistics
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
And then:
results = dlply(eyeData, c("sNumber", "TrialNo"), ordFit_stats)

Data simulation according to specific rules in R

I need help simulating a dataset.
It is supposed to simulate all possible outcomes on a signal detection theory task (participants are presented with trials and have to decide whether or not they detected given signal). Now, I need a dataset of all possible values for varying number of trials.
Say, there are 6 trials, 5 with the signal present, 5 with the signal absent. I am only interested in correct detections (hits) and false alarms (Type I errors). A participant can correctly detect between 1 (I don't need 0's) and 5 and make the same number of false alarms. With all possible combinations, that would be dataset containing two variables with 5^2 cases each. To make things more complicated, even the number of trials is variable. The number of both signal and non-signal trials can vary between 1 and 20 but the total number of trials cannot be less than 3 (either 1 S trial and 2 Non-S trials, or the other way around). And for each possible combination of trials, there is a group of possible combinations of hits and false alarms.
What I need is a dataset with 5 variables (total N, N of S trials, N of Non-S trials, N of Hits, and N of False Alarms) with all the possible values.
EXAMPLE
Here are all possible data for total N of 4. Note that Signal + Noise = N_total and that N_Hit seq(1:Signal) and N_FA seq(1:Noise)
N_total Signal Noise N_Hit N_FA
4 1 3 1 1
4 1 3 1 2
4 1 3 1 3
4 2 2 1 1
4 2 2 1 2
4 2 2 2 1
4 2 2 2 2
4 3 1 1 1
4 3 1 2 1
4 3 1 3 1
I'm an R novice so any help at all would be much appreciated!
Hope the description is clear.
I created a function, which uses the number of trials as parameter.
myfunc <- function(n) {
# create a data frame of all combinations
grid <- expand.grid(rep(list(seq_len(n - 1)), 4))
# remove invalid combinations (keep valid ones)
grid <- grid[grid[3] <= grid[1] & # number of hits <= number of signals
grid[4] <= grid[2] & # false alarms <= noise
(grid[1] + grid[2]) == n , ] # signal and noise sum to total n
# remove signal and noise > 20
grid <- grid[!rowSums(grid[1:2] > 20), ]
# sort rows
grid <- grid[order(grid[1], grid[3], grid[4]), ]
# add total number of trials
res <- cbind(n, grid)
# remove row names, add column names and return the object
return(setNames("rownames<-"(res, NULL),
c("N_total", "Signal", "Noise", "N_Hit", "N_FA")))
}
Use the function:
> myfunc(4)
N_total Signal Noise N_Hit N_FA
1 4 1 3 1 1
2 4 1 3 1 2
3 4 1 3 1 3
4 4 2 2 1 1
5 4 2 2 1 2
6 4 2 2 2 1
7 4 2 2 2 2
8 4 3 1 1 1
9 4 3 1 2 1
10 4 3 1 3 1
How to apply this function to the values 3-40:
lapply(3:40, myfunc)
This will return a list of data frames.

Resources