Which dataset do I use for calculating the Calinski Harabasz index? - r

I am doing a Clusteranalysis on the most significant componants.
In order to find the number of clusters I apply the Calinski Harabasz index. I have two questions:
Do I need to normalize the components before clustering. I haven't done it so far as the variance expresses the importance of a component.
Concerning the CH-index, do I calculate it on the original data or do I calculate it on the output of my pca function? I try to clarify:
pca <- prcomp(data_scaled)
pca$x
Here I use pca$x for the cluster analysis.Should I use the data_scaled dataset or the pca$x dataset for calculating the CH-index?

Related

how is obtained the observed cell counts in svychisq in R?

I'm using the function survey::svychisq() to test for independence in a two-way contingency table for complex samples.
With svytable() I get the observed counts considering the weights defined in design, and I would assume that the observed values saved in svychisq() objects would be the same, but they are not:
svytable(~row.var+col.var, design)
# 330.6634 867.6478 177.1630
# 687.4503 962.5404 228.2926
and
svychisq(~row.var+col.var, design)$observed
# 404.6712 1061.8411 216.8149
# 841.3126 1177.9722 279.3881
provide different results and I couldn't really understand why.
Could someone explain to me how the latter observed values are calculated?
Thanks!
The $observed component is a weighted table with weights scaled to sum to the sample size.

Clustering as a dimension reduction technique, and how to pick representatives elements for each cluster?

I have some dataset in which some observations are highly correlated. I am doing a clustering analysis on the distance matrix obtained from the correlation matrix. Some elements in this datasets are redundant and I want to select some representatives elements with a minimal mutual correlation. I think that a brute-force method is to simply choose one element from each cluster. But I want to know if there are more formal methods for such conceived dimensionality reduction in R ?
For instance, we are doing the clustering on the mtcars dataset in the following manner:
> m=cor(t(mtcars))
> hc=hclust(as.dist(m),"ave")
> plot(hc)
We are obtaining the following dendrogram:
How to extract from the above dendrograms essential elements ? This mean elements which are minimally mutually correlated ?
One option would be to use some of the pre-processing functions within the caret package.
Using your example, the code below will remove all columns that have 0.95 correlation with another column.
library(caret)
m <- cor(t(mtcars))
highlyCor <- findCorrelation(m, cutoff = .95)
t(mtcars)[,-highlyCor]
The above code is adapted from Max Kuhn's excellent book. Refer to it and caret documentation for more background and information.

How to calculate the quality of clustering by dtw?

my aim is to cluster 126 time-series concerning 26 weeks (so each time-series has 26 observation). I used pam{cluster} = partitioning around medoids to cluster these time-series.
Before clustering I wanted to compare which distance measure is the most appropriate: euclidean, manhattan or dynamic time warping. I used each distance to cluster and compare by silhouette plot. Is there any way I can compare different distance measure?
For example I know that procedure clValid {clValid} to validate cluster results, however I cannot implement dtw to calculate indexes.
So how can I compare different distance metrics (not only by silhouette)?
Additional question: is GAP statistic enough to decide how many clusters choose? Or should I evaluate number of clusters with different methods or compare two or three ways how to do it?
I would be grateful for any suggestions.
I have just read the book "cluster analysis, fifth edition" by Brian S. Everitt, etc. And currently, I adopt the following strategy to select method to calculate distance matrix, clustering and validation:
for distance: using cmdscale{stats} function to calculate multidimentional scaling, and plot the scatterplot of the two scaling dimensions with density information. As expected, if there is distinct clusters or nested clusters, the scatterplot will give some hints.
for clustering: for every clustering method, calculate cophenetic correlation between clustering results and the distance, this can be calculated using cophenetic{stats} function. The best clustering method will give higher correlation. However, this is only working for hierarchical clustering. I haven't idea for other clustering methods, like pam, or kmeans.
for partition evaluation: package {clusterSim} give several function to calculate the index to evaluate the clustering quality. Another package {NbClust} also calculate so many as 30 index to evaluate the combination of "distance", "clustering" and "number of clusters". However, this package partition the hierarchical tree using {cutree}, which is not suitable for nested clustering structure. Another method provided by {dynamicTreeCut} give reasonable results.
for cluster number determination: will added later.
Cluster data for which you have class labels, and use the RAND index to measure cluster quality.
50 such datasets are at the UCR time series archive
This paper does something similar
http://www.cs.ucr.edu/~eamonn/ClusteringTimeSeriesUsingUnsupervised-Shapelets.pdf

Looking for an efficient way to compute the variances of a multinomial distribution in R

I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?
If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))

Is it possible to specify a range for numbers randomly generated by mvrnorm( ) in R?

I am trying to generate a random set of numbers that exactly mirror a data set that I have (to test it). The dataset consists of 5 variables that are all correlated with different means and standard deviations as well as ranges (they are likert scales added together to form 1 variable). I have been able to get mvrnorm from the MASS package to create a dataset that replicated the correlation matrix with the observed number of observations (after 500,000+ iterations), and I can easily reassign means and std. dev. through z-score transformation, but I still have specific values within each variable vector that are far above or below the possible range of the scale whose score I wish to replicate.
Any suggestions how to fix the range appropriately?
Thank you for sharing your knowledge!
To generate a sample that does "exactly mirror" the original dataset, you need to make sure that the marginal distributions and the dependence structure of the sample matches those of the original dataset.
A simple way to achieve this is with resampling
my.data <- matrix(runif(1000, -1, 2), nrow = 200, ncol = 5) # Some dummy data
my.ind <- sample(1:nrow(my.data), nrow(my.data), replace = TRUE)
my.sample <- my.data[my.ind, ]
This will ensure that the margins and the dependence structure of the sample (closely) matches those of the original data.
An alternative is to use a parametric model for the margins and/or the dependence structure (copula). But as staded by #dickoa, this will require serious modeling effort.
Note that by using a multivariate normal distribution, you are (implicity) assuming that the dependence structure of the original data is the Gaussian copula. This is a strong assumption, and it would need to be validated beforehand.

Resources