How to calculate the quality of clustering by dtw? - r

my aim is to cluster 126 time-series concerning 26 weeks (so each time-series has 26 observation). I used pam{cluster} = partitioning around medoids to cluster these time-series.
Before clustering I wanted to compare which distance measure is the most appropriate: euclidean, manhattan or dynamic time warping. I used each distance to cluster and compare by silhouette plot. Is there any way I can compare different distance measure?
For example I know that procedure clValid {clValid} to validate cluster results, however I cannot implement dtw to calculate indexes.
So how can I compare different distance metrics (not only by silhouette)?
Additional question: is GAP statistic enough to decide how many clusters choose? Or should I evaluate number of clusters with different methods or compare two or three ways how to do it?
I would be grateful for any suggestions.

I have just read the book "cluster analysis, fifth edition" by Brian S. Everitt, etc. And currently, I adopt the following strategy to select method to calculate distance matrix, clustering and validation:
for distance: using cmdscale{stats} function to calculate multidimentional scaling, and plot the scatterplot of the two scaling dimensions with density information. As expected, if there is distinct clusters or nested clusters, the scatterplot will give some hints.
for clustering: for every clustering method, calculate cophenetic correlation between clustering results and the distance, this can be calculated using cophenetic{stats} function. The best clustering method will give higher correlation. However, this is only working for hierarchical clustering. I haven't idea for other clustering methods, like pam, or kmeans.
for partition evaluation: package {clusterSim} give several function to calculate the index to evaluate the clustering quality. Another package {NbClust} also calculate so many as 30 index to evaluate the combination of "distance", "clustering" and "number of clusters". However, this package partition the hierarchical tree using {cutree}, which is not suitable for nested clustering structure. Another method provided by {dynamicTreeCut} give reasonable results.
for cluster number determination: will added later.

Cluster data for which you have class labels, and use the RAND index to measure cluster quality.
50 such datasets are at the UCR time series archive
This paper does something similar
http://www.cs.ucr.edu/~eamonn/ClusteringTimeSeriesUsingUnsupervised-Shapelets.pdf

Related

Weighted Cluster Analysis in R — generating more clusters than requested with hclust

I'm trying to conduct a hierarchical agglomerative cluster analysis in R by using the Weighted Cluster package. Before doing so, I calculated the distances between state sequences by leveraging the TraMineR package (see pp. 4-6 here).
Following the vignette hyperlinked above, I fed my distance matrix into hclust while adding a vector of weights as follows (datadist is the distance matrix; dataframe is my data frame featuring time series data; and weight is an all-waves longitudinal survey weight):
Cluster <- hclust(as.dist(datadist), method = "ward", members = dataframe$weight)
Then, after arriving at a specific cluster solution (four subgroups), I used the cutree function to determine the relative frequency of each cluster and assign cases:
subgroups <- cutree(Cluster, k = 4)
However, I somehow generated more than four groups after executing the code above (over 30, in fact). When I removed the vector of weights, I was able to produce frequencies for four clusters, but unweighted results are sub-optimal.
If anyone out there can help me understand what's going on (and how I can address or treat the problem), it would be greatly appreciated.

Weighted observation frequency clustering using hclust in R

I have a large matrix of 500K observations to cluster using hierarchical clustering. Due to the large size, i do not have the computing power to calculate the distance matrix.
To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. I have the frequency for each of the rows in this aggregated matrix. I now need to incorporate this frequency as a weight in my hierarchical clustering.
The data is a mixture of numerical and categorical variables for the 500K observations so i have used the daisy package to calculate the gower dissimilarity for my aggregated dataset. I want to use hclust in the stats package for the aggregated dataset however i want to take into account the frequency of each observation. From the help information for hclust the arguments are as follows:
hclust(d, method = "complete", members = NULL)
The information for the members argument is:, NULL or a vector with length size of d. See the ‘Details’ section. When you look at the details section you get: If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.
From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. I would like to use it like this:
hclust(d, method = "complete", members = df$freq)
Where df$freq is the frequency of each row in the aggregated matrix. So if a row is duplicated 10 times this value would be 10.
If anyone can help me that would be great,
Thanks
Yes, this should work fine for most linkages, in particular single, group average and complete linkage. For ward etc. you need to correctly take the weights into account yourself.
But even that part is not hard. Just make sure to use the cluster sizes, because you need to pass the distance of two clusters, not two points. So the matrix should contain the distance of n1 points at location x and n2 points at location y. For min/max/mean this n disappears or cancels out. For ward, you should get a SSQ like formula.

Hierarchical clustering on continuous heterogeneous variables with different range/scales in R

I would like to use R to perform hierarchical clustering with two groups of variables describing the same samples. One group is microarray gene expression data (for specific genes) that have been normalized and batch effect corrected. The other group also has some quantitative clinical parameters that describe the same samples. However, these clinical variables have not been normalized or subjected to any kind of transformation(i.e. raw continuous values).
For example, one variable of these could have range of values from 2 to 35, whereas another from 0.1 to 0.9, etc.
Thus, as my ultimate goal in to implement hierarchical clustering and use both groups simultaneously (merged in a matrix/dataframe), in order to inspect which of these clinical variables cluster with specific genes, etc:
1) Is an initial transformation in the group of the clinical variables necessary before merging with the genes and perform the clustering ? For example: log2 transformation, which has also been done to part of my gene expression data !!
2) Or, a row scaling (that is the total features in the input data) would take into account this discrepancy ?
3) For a similar analysis/approach, like constructing a correlation plot of the above total variables, would a simple scaling be sufficient?
Without having seen your gene expression data, I can only provide you some general suggestions based on your description, in the context of the 3 questions you asked:
1) You should definitely check the distribution of each group. In R, you may use one or more of the following function to visualize the distribution:
hist(expression_data) ##histogram
plot(density(expression_data)) ##density plot; alternative to histogram
qqnorm(expression_data); qqline(expression_data) #QQ plot
Since my understanding is that one of your expression data group is log2 transformed, that particular group should have a normal distribution (i.e. a bell curve shape in the histogram and a straight line in the QQ plot). Whether to transform the group that has not yet been transformed will depend on what you want to do with the data. For instance, if you want to use a t-test to compare the two groups, then you definitely need a transformation, as there is a normality assumption associated with a t-test. With regard to hierarchical clustering, if you decide to use both groups in a single clustering analysis, then why would you ever keep one transformed and the other not?
2) Scaling by features is a reasonable approach. Here is a clustering lecture from a Utah State Univ. stats course, with an example. scale=TRUE is an option for you if you decide to use heatmap function in R.
3) I don't think there is a definitive answer to your third question. It has to depend on how many available features you have and what analyses you will be doing downstream. Similar to question 1, I would argue that simple scaling may be sufficient for visualizing your data by hierarchical clustering. However, do keep in mind that, say you decide to perform a linear model (which is very common with microarray gene expression data), you might want to consider more sophisticated data scaling.

Timeseries cluster validation: using cluster.stats metrics to decide optimal cluster number

I am clustering timeseries data using appropriate distance measures and clustering algorithms for longitudinal data. My goal is to validate the optimal number of clusters for this dataset, through cluster result statistics. I read a number of articles and posts on stackoverflow on this subject, particularly: Determining the Optimal Number of Clusters. Visual inspection is only possible on a subset of my data; I cannot rely on it to be representative of my whole dataset since I am dealing with big data.
My approach is the following:
1. I cluster several times using different numbers of clusters and calculate the cluster statistics for each of these options
2. I calculate the cluster statistic metrics using FPC's cluster.stats R package: Cluster.Stats from FPC Cran Package. I plot these and decide for each metric which is the best cluster number (see my code below).
My problem is that these metrics each evaluate a different aspect of the clustering "goodness", and the best number of clusters for one metric may not coincide with the best number of clusters of a different metric. For example, Dunn's index may point towards using 3 clusters, while the within-sum of squares may indicate that 75 clusters is a better choice.
I understand the basics: that distances between points within a cluster should be small, that clusters should have a good separation from each other, that the sum of squares should be minimized, that observations which are in different clusters should have a large dissimilarity / different clusters should ideally have a strong dissimilarity. However, I do not know which of these metrics is most important to consider in evaluating cluster quality.
How do I approach this problem, keeping in mind the nature of my data (timeseries) and the goal to cluster identical series / series with strongly similar pattern regions together?
Am I approaching the clustering problem the right way, or am I missing a crucial step? Or am I misunderstanding how to use these statistics?
Here is how I am deciding the best number of clusters using the statistics:
cs_metrics is my dataframe which contains the statistics.
Average.within.best <- cs_metrics$cluster.number[which.min(cs_metrics$average.within)]
Average.between.best <- cs_metrics$cluster.number[which.max(cs_metrics$average.between)]
Avg.silwidth.best <- cs_metrics$cluster.number[which.max(cs_metrics$avg.silwidth)]
Calinsky.best <- cs_metrics$cluster.number[which.max(cs_metrics$ch)]
Dunn.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn)]
Dunn2.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn2)]
Entropy.best <- cs_metrics$cluster.number[which.min(cs_metrics$entropy)]
Pearsongamma.best <- cs_metrics$cluster.number[which.max(cs_metrics$pearsongamma)]
Within.SS.best <- cs_metrics$cluster.number[which.min(cs_metrics$within.cluster.ss)]
Here is the result:
Here are the plots that compare the cluster statistics for the different numbers of clusters:

Determining optimum number of clusters for k-means with a large dataset

I have a matrix of 62 columns and 181408 rows that I am going to be clustering using k-means. What I would ideally like is a method of identifying what the optimum number of clusters should be. I have tried implementing the gap statistic technique using clusGap from the cluster package (reproducible code below), but this produces several error messages relating to the size of the vector (122 GB) and memory.limitproblems in Windows and a "Error in dist(xs) : negative length vectors are not allowed" in OS X. Does anyone has any suggestions on techniques that will work in determining optimum number of clusters with a large dataset? Or, alternatively, how to make my code function (and does not take several days to complete)? Thanks.
library(cluster)
inputdata<-matrix(rexp(11247296, rate=.1), ncol=62)
clustergap <- clusGap(inputdata, FUN=kmeans, K.max=12, B=10)
At 62 dimensions, the result will likely be meaningless due to the curse of dimensionality.
k-means does a minimum SSQ assignment, which technically equals minimizing the squared Euclidean distances. However, Euclidean distance is known to not work well for high dimensional data.
If you don't know the numbers of the clusters k to provide as parameter to k-means so there are three ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.
This is from RBloggers.
https://www.r-bloggers.com/k-means-clustering-from-r-in-action/
You could do the following:
data(wine, package="rattle")
head(wine)
df <- scale(wine[-1])
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(df)
this will create a plot like this.
From this you can choose the value of k to be either 3 or 4. i.e
there is a clear fall in 'within groups sum of squares' when moving from 1 to 3 clusters. After three clusters, this decrease drops off, suggesting that a 3-cluster solution may be a good fit to the data.
But like Anony-Mouse pointed out, the curse of dimensionality affects due to the fact that euclidean distance being used in k means.
I hope this answer helps you to a certain extent.

Resources