Looking for analysis that clusters like SIMPROF, but allows for many observations per category - r

I need to run a clustering or similarity analysis on some biological data and I am looking for an output like the one SIMPROF gives. Aka a dendrogram or hierarchical cluster.
However, I have 3200 observations/rows per group. SIMPROF, see example here,
library(clustsig)
usarrests<-USArrests[,c(1,2,4)]
rownames(usarrests)<-state.abb
# Run simprof on the data
res <- simprof(data= usarrests,
method.distance="braycurtis")
# Graph the result
pl.color <- simprof.plot(res)
seems to expect only one observation per group (US state in this example).
Now, again, my biological data (140k rows total) has about 3200 obs per group.
I am trying to cluster the groups together that have a similar representation in the variables provided.
As if in the example above, AK would be represented by more than one observation.
What's my best bet for a function/package/analysis?
Cheers,
Mo
Example from a paper:

The solution became obvious upon further reflection.
Instead of using all observations (200k) in the long format, I made longitude and depth of sampling into one variable, used like sampling units along a transect. Thus, ending up with 3800 columns of longitude - depth combinations, and 61 rows for the taxa, with the value variable being the abundance of the taxa (If you want to cluster sampling units then you have to transpose the df). This is then feasible for hclust or SIMPROF since now the quadratic complexity only applies to 61 rows (as opposed to ~200k as I tried at the beginning).
Cheers
Here is some code:
library(reshape2)
library(dplyr)
d4<-d4 %>% na.omit() %>% arrange(desc(LONGITUDE_DEC))
# make 1 variable of longitude and depth that can be used for all taxa measured, like
#community ecology sampling units
d4$sampling_units<-paste(d4$LONGITUDE_DEC,d4$BIN_MIDDEPTH_M)
d5<-d4 %>% select(PREDICTED_GROUP,CONCENTRATION_IND_M3,sampling_units)
d5<-d5%>%na.omit()
# dcast data frame so that you get the taxa as rows, sampling units as columns w
# concentration/abundance as values.
d6<-dcast(d5,PREDICTED_GROUP ~ sampling_units, value.var = "CONCENTRATION_IND_M3")
d7<-d6 %>% na.omit()
d7$PREDICTED_GROUP<-as.factor(d7$PREDICTED_GROUP)
# give the rownames the taxa names
rownames(d7)<-paste(d7$PREDICTED_GROUP)
#delete that variable that is no longer needed
d7$PREDICTED_GROUP<-NULL
library(vegan)
# calculate the dissimilarity matrix with vegdist so you can use the sorenson/bray
#method
distBray <- vegdist(d7, method = "bray")
# calculate the clusters with ward.D2
clust1 <- hclust(distBray, method = "ward.D2")
clust1
#plot the cluster dendrogram with dendextend
library(dendextend)
library(ggdendro)
library(ggplot2)
dend <- clust1 %>% as.dendrogram %>%
set("branches_k_color", k = 5) %>% set("branches_lwd", 0.5) %>% set("clear_leaves") %>% set("labels_colors", k = 5) %>% set("leaves_cex", 0.5) %>%
set("labels_cex", 0.5)
ggd1 <- as.ggdend(dend)
ggplot(ggd1, horiz = TRUE)

Related

GAM distributed lag model with factor smooth interaction (by variable)

I'm trying to compare the climate response in the last 60 years of two subgroups of a plant (factor variable subgroups with 2 levels). The response of the two subgroups which both grew on the same plots is measured in deviation from the long-term growth (plant_growth). As climate data mean temperature (tmean) and mean precipitation (prec) are available.
I formulated a distributed lag model using mgcv's gam() to test the hypothesis, that the climate response differs between the plant subgroups:
climate_model <- gam(plant_growth ~ te(tmean, lag, by = subgroups) +
te(prec, lag, , by = subgroups) +
te(tmean, prec, lag, , by = subgroups) ,
data = plant_data)
plant_data is a list that contains tmean, prec and lag as separate numeric matrices, subgroups as factor variable which distinguishes between subgroup A and B, a character variable giving the ID of the plant, and the numeric measured plant_growth as vector.
The problem is, however, that factor by variables cannot be used with the matrix arguments from plant_data. The error message looks as follows:
Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons, scale.penalty = scale.penalty, :
factor `by' variables can not be used with matrix arguments.
I'm wondering if there is a way to include the factor variable subgroups into the distributed lag model so that a comparison between the two levels of the factor is possible.
I've already tried running two separate lag models for the two levels of subgroups. This works fine. However, I cannot really compare the predictions of the two models because the fit and the parameters of the smooths are different. Moreover, in this way the the climate response of the two subgroups is treated as if it was completely independent. This is however not the case.
I was reproduce my problem with growth data from the Treeclim package:
library("treeclim") #Data library
data("muc_spruce") #Plant growth
data("muc_clim") #Climate data
#Format climate to wide
clim <- pivot_wider(muc_clim, names_from = month, values_from = c(temp,prec))
#Format the growth data and add three new groth time series
growth <- muc_spruce %>%
select(-samp.depth) %>%
mutate(year = as.numeric(row.names(muc_spruce))) %>%
mutate(ID = 1) %>%
rename("plant_growth" = "mucstd")
additional_growth <- data.frame()
for (i in c(1:3)){
A <- growth %>%
mutate(plant_growth = plant_growth + runif(nrow(muc_spruce), min = 0, max = 0.5)) %>%
mutate(ID = ID + i)
additional_growth <- rbind(additional_growth, A)
}
growth <- rbind(growth, additional_growth)
#Bring growth and climate data together
plant_data <- na.omit(left_join(growth, clim))
rm(A, growth, clim, muc_clim, muc_spruce, additional_growth, i) #clean
#Add the subgroups label
plant_data$subgroups <- as.factor(c(rep("A", nrow(plant_data)/2), rep("B", nrow(plant_data)/2)))
#Format for gam input
plant_data <- list(lag = matrix(1:12,nrow(plant_data),12,byrow=TRUE),
year = plant_data$year,
ID = plant_data$ID,
plant_growth = plant_data$plant_growth,
subgroups = as.factor(plant_data$subgroups),
tmean = data.matrix(plant_data[,c(4:15)]),
prec = data.matrix(plant_data[,c(16:27)]))
From ?mgcv::linear.functional.terms:
The mechanism is usable with random effect smooths which take factor arguments, by using a trick to create a 2D array of factors. Simply create a factor vector containing the columns of the factor matrix stacked end to end (column major order). Then reset the dimensions of this vector to create the appropriate 2D array: the first dimension should be the number of response data and the second the number of columns of the required factor matrix. You can not use matrix or data.matrix to set up the required matrix of factor levels. See example below:
## set up a `factor matrix'...
fac <- factor(sample(letters,n*2,replace=TRUE))
dim(fac) <- c(n,2)
You cannot create a factor matrix tough, but can create a factor and modify the dims afterwars.

vegan::betadisper() extract distance and error associated with centroid

I am trying to construct a meta regression to look at distance between centroids across multiple independent monitoring datasets. To build that model, for each dataset I need to extract the distance to each centroid (each dataset has the same two grouping variables -- before, after), the number of points that went into calculating the centroid (n), and the standard deviation associated with each distance to centroid (sd). I'm using vegan::betadisper() to calculate the distance to each centroid, but I am not sure whether it is possible to extract a single unit of standard deviation associated with the centroid?
I've modified the dune dataset below as sample code. The 'Use' grouping variable has two levels: before, after.
rm(list=ls())
library (vegan)
library(dplyr)
# Species and environmental data
dune2.spe <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.spe.txt', row.names = 1)
dune2.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.env.txt', row.names = 1)
data (dune) # matrix with species data (20 samples in rows and 30 species in columns)
data (dune.env)# matix of environmental variables (20 samples in rows and 5 environmental variables in columns)
#select two grouping levels for 'use'
dune_data <- cbind(dune2.spe,dune2.env)%>%
filter(Use=='Pasture'|Use=='Hayfield')
dune_data$Use <- recode_factor(dune_data$Use, 'Pasture'='Before')
dune_data$Use <- recode_factor(dune_data$Use, 'Hayfield'='After')
dune_sp <- dune_data%>%
dplyr::select(1:28)
dune_en <- dune_data%>%
dplyr::select(29:33)
#transform relative species counts
dune_rel <- decostand(dune_sp, method = "hellinger")
dune_distmat <- vegdist(dune_rel, method = "bray", na.rm=T)
(dune_disper <- betadisper(dune_distmat, type="centroid", group=dune_en$Use))
plot(dune_disper, label=FALSE)
I am trying to arrive at the following output:
Group
before_distance
n_before
sd_before
after_distance
n_after
sd_after
Dune
0.4009
5
?
0.4314
7
?

Clustering using daisy and pam in R

I'm trying to perform a pretty straightforward clustering analysis but can't get the results right. My question for a large dataset is "Which diseases are frequently reported together?". The simplified data sample below should result in 2 clusters: 1) headache / dizziness 2) nausea / abd pain. However, I can't get the code right. I'm using the pam and daisy functions. For this example I manually assign 2 clusters (k=2) because I know the desired result, but in reality I explore several values for k.
Does anyone know what I'm doing wrong here?
library(cluster)
library(dplyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))
gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k) # performs cluster analysis
pam_results <- dat %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
The format in which you give the dataset to the clustering algorithm is not precise for your objective. In fact, if you want to group diseases that are reported together but you also include IDs in your dissimilarity matrix, they will have a part in the matrix construction and you do not want that, since your objective regards only the diseases.
Hence, we need to build up a dataset in which each row is a patient with all the diseases he/she reported, and then construct the dissimilarity matrix only on the numeric features. For this task, I'm going to add a column presence with value 1 if the disease is reported by the patient, 0 otherwise; zeros will be filled automatically by the function pivot_wider (link).
Here is the code I used and I think I reached what you wanted to, please tell me if it is so.
library(cluster)
library(dplyr)
library(tidyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
presence = 1)
# build the wider dataset: each row is a patient
dat_wider <- pivot_wider(
dat,
id_cols = ID,
names_from = PTName,
values_from = presence,
values_fill = list(presence = 0)
)
# in the dissimalirity matrix construction, we leave out the column ID
gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
k <- 2
set.seed(123)
pam_fit <- pam(gower_dist, diss = TRUE, k)
pam_results <- dat_wider %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
Furthermore, since you are working only with binary data, instead of Gower's distance you can consider using the Simple Matching or Jaccard distance if they suit your data better. In R you can employ them using
sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
j_dist <- dist(dat_wider %>% select(-ID), method = "binary")
respectively, where p is the number of binary variables you want to consider.

Clustering leads to very concentrated clusters

To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8
Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.
Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.
The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.
With help from another one, I managed to create a distance matrix and use the PAM clustering.
Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.
Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.
Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?
The whole code and results:
#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1
Now using the PAM clustering
dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))
But I get very concentrated clusters, as:
1 2 3 4
382 100 23 62
I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.
dim(df) # [1] 7659 3
test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318 3
length(unique(test$V1)) # 567
length(unique(test$V2)) # 567
test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567
#Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables).
Continuing further, See this post on correlation detection in R;
# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables.
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"
removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32 4
Hope this helps.

Clustering a mixed data set in R

I have a mixed data set (has factor and numeric variable types) and I want to do some clustering analysis. This is so that I will be able to study the entries in each cluster to tell what they have in common.
I know that for this type of data set, the distance to use is "Gower distance".
This what I have done so far:
cluster <- daisy(mydata, metric = c("euclidean", "manhattan", "gower"),
stand = FALSE, type = list())
try <- agnes(cluster)
plot(try, hang = -1)
The above gave me a dendrogram but I have 2000 entries in my data and I am unable to identify the individual entries at the end of the dendrogram. Also, I want to be able to extract the clusters from the dendrogram.
There should be only one metric in the
daisy function. The daisy function provides a distance matrix of (mixed-type) observations.
To obtain the cluster labels from the agnes, one can use the cutree function. See the following example using the mtcars data set;
Preparing of the data
The mtcars data frame has all variables on the numerical scale. However, when one looks at the description of the variables, it is apparent some of
the variables cannot be used as numeric variables when clustering the data.
For example, vs, the shape of the engine should be a (unordered) factor variable, while the number of gears should be an ordered factor.
# directly from the ?mtcars
mtcars2 <- within(mtcars, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
Compute the dissimilarity matrix
# Compute all the pairwise dissimilarities (distances) between observations
# in the data set.
diss_mat <- daisy(mtcars2, metric = "gower")
Clustering the dissimilarity matrix
# Computes agglomerative hierarchical clustering of the dataset.
k <- 3
agnes_clust <- agnes(x = diss_mat)
ag_clust <- cutree(agnes_clust, k)
# Clustering the dissimilarity matrix using
# partitioning around medoids
pam_clust <- pam(diss_mat, k)
# A comparision of the two clusterings
table(ag_clust, pam_clust=pam_clust$clustering)
# pam_clust
# ag_clust 1 2 3
# 1 6 0 0
# 2 2 10 2
# 3 0 0 12
Other packages
A couple of other packages to cluster mixed-type data are
CluMix and FD.

Resources