Finding best number of clusters knowing only dissimilarity matrix in R [duplicate] - r

This question already has answers here:
How to use dissimilarity matrix with NbClust in R
(2 answers)
Closed 5 years ago.
I have a dissimilarity matrix and I want to run hierarchical clustering using that matrix as the only input as I don't know the source data itself. For background, I aim at clustering elements using their mutual correlation as distance. Following the methodology indicate in here, I'm using the correlation matrix to compute the dissimilarity matrix to be given to hclust as input. This is working fine.
My question is: how do I find the optimal number of clusters? Is there an index that can be computed by only knowing the dissimilarity matrix? The indices in NbClust require the source data to run - it is not enough to know the dissimilarity matrix. Is there any other method I can use in R?

By just quickly looking at NbClust documentation it appears doable to only provide with the dissimilarity matrix omitting the original data source.
NbClust(data = NULL, diss = XYZ, distance = NULL ... etc
As the matrix is supplied (here referred to as XYZ), data and distance must be set to NULL. This is stated in the function Usage. NbClust should then be able to produce the partition index you are after.

Related

PCA : eigen values vs eigen vectors vs loadings in python vs R?

I am trying to calculate PCA loadings of a dataset. The more I read about it, the more I get confused because "loadings" is used differently at many places.
I am using sklearn.decomposition in python for PCA analysis as well as R (using factomineR and factoextra libraries) as it provides easy visualization techniques. The following is my understanding:
pca.components_ give us the eigen vectors. They give us the directions of maximum variation.
pca.explained_variance_ give us the eigen values associated with the eigen vectors.
eigenvectors * sqrt(eigen values) = loadings which tell us how principal components (pc's) load the variables.
Now, what I am confused by is:
Many forums say that eigen vectors are the loadings. Then, when we multiply the eigen vectors by the sqrt(eigen values) we just get the strength of association. Others say eigenvectors * sqrt(eigen values) = loadings.
Eigen vectors squared tells us the contribution of variable to pc? I believe this is equivalent to var$contrib in R.
loading squared (eigen vector or eigenvector*sqrt(eigenvalue) I don't know which one) shows how well a pc captures a variable (closer to 1 = variable better explained by a pc). Is this equivalent of var$cos2 in R? If not what is cos2 in R?
Basically I want to know how to understand how well a principal component captures a variable and what is the contribution of a variable to a pc. I think they both are different.
What is pca.singular_values_? It is not clear from the documentation.
These first and second links that I referred which contains R code with explanation and the statsexchange forum that confused me.
Okay, after much research and going through many papers I have the following,
pca.components_ = eigen vectors. Take a transpose so that pc's are columns and variables are rows.
1.a: eigenvector**2 = variable contribution in principal components. If it's close to 1 then a particular pc is well explained by that variable.
In python -> (pow(pca.components_.T),2) [Multiply with 100 if you want percentages and not proportions] [R equivalent -> var$contrib]
pca.variance_explained_ = eigen values
pca.singular_values_ = singular values obtained from SVD.
(singular values)**2/(n-1) = eigen values
eigen vectors * sqrt(eigen values) = loadings matrix
4.a: vertical sum of squared loading matrix = eigen values. (Given you have taken transpose as explained in step 1)
4.b: horizontal sum of squared loading matrix = observation's variance explained by all principal components -How much all pc's retain a variables variance after transformation. (Given you have taken transpose as explained in step 1)
In python-> loading matrix = pca.components_.T * sqrt(pca.explained_variance_).
For questions pertaining to r:
var$cos2 = var$cor (Both matrices are same). Given the coordinates of the variables on a factor map, how well it is represented by a particular principal component. Seems like variable and principal component's correlation.
var$contrib = Summarized by point 1. In r:(var.cos2 * 100) / (total cos2 of the component) PCA analysis in R link
Hope it helps others who are confused by PCA analysis.
Huge thanks to -- https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another

Different results between fpc::dbscan and dbscan::dbscan

I want to implement DBSCAN in R on some GPS coordinates. I have a distance matrix (dist_matrix) that I fed into the following functions:
dbscan::dbscan(dis_matrix, eps=50, minPts = 5,borderPoints=TRUE)
fpc::dbscan(dis_matrix,eps = 50,MinPts = 5,method = "dist")
and Im getting very different results from both functions in terms of number of clusters and if a point is a noise point or belongs to a cluster. Basically, the results are inconsistent between two algorithms. I have no clue why they generate these very different results although here
http://www.sthda.com/english/wiki/wiki.php?id_contents=7940
we see for iris data, both functions did the same.
my distance matrix [is coming from a function (geosphere::distm) which calculates the spatial distance between more than 2000 coordinates.
Furthermore, I coded dbscan according to this psuedo-code
source: https://cse.buffalo.edu/~jing/cse601/fa13/materials/clustering_density.pdf
My results are equal to what I obtained from fpc package.
Can anyone notice why they are different. I already looked into both functions and haven't found anything.
The documentation of geosphere::distm says that it does not return a dist object but a matrix. dbscan::dbscan assumes that you have a data matrix and not distances. Convert your matrix into a dist object with as.dist first. THis should resolve the problem.

Can I extract certain dissimilarity values from a bray curtis matrix in R?

I am trying to assess how long a change in stream communities persists following invasive species removals. Community was assessed in a series of streams before and at several time points after the removal. I would like to calculate dissimilarity between the community immediately before and at each time point following removal (i.e. between pre and post 1, pre and post 2, pre and post 3, etc.), then analyze these values across time to determine how long it takes for streams communities to return to pre-removal conditions. Differences in stream parameters are expected to influence the process, so I need to do this by stream. I'm picturing a plot with time on the x axis, dissimilarity on the y axis, and a different function for each stream. Is there a way to pull just these dissimilarity values out of a bray curtis matrix?
This is my first question, I apologize for any missing information or lack of clarity.
If you convert the output from vegdist() to a matrix then you can easily pull out the dissimilarities between pairs of samples, using rownames or indices of the matrix to extract what you want.
To convert to a matrix use
distmat <- as.matrix(bc_obj)
where bc_obj is the object returned from vegdist() containing the Bray Curtis distances.

R: clustering with a similarity or dissimilarity matrix? And visualizing the results

I have a similarity matrix that I created using Harry—a tool for string similarity, and I wanted to plot some dendrograms out of it to see if I could find some clusters / groups in the data. I'm using the following similarity measures:
Normalized compression distance (NCD)
Damerau-Levenshtein distance
Jaro-Winkler distance
Levenshtein distance
Optimal string alignment distance (OSA)
("For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output")
At first, it was like my first time using R, I didn't pay to much attention on the documentation of hclust, so I used it with a similarity matrix. I know I should have used a dissimilarity matrix, and I know, since my similarity matrix is normalized [0,1], that I could just do dissimilarity = 1 - similarity and then use hclust.
But, the groups that I get using hclustwith a similarity matrix are much better than the ones I get using hclustand it's correspondent dissimilarity matrix.
I tried to use the proxy package as well and the same problem, the groups that I get aren't what I expected, happens.
To get the dendrograms using the similarity function I do:
plot(hclust(as.dist(""similarityMATRIX""), "average"))
With the dissimilarity matrix I tried:
plot(hclust(as.dist(""dissimilarityMATRIX""), "average"))
and
plot(hclust(as.sim(""dissimilarityMATRIX""), "average"))
From (1) I get what I believe to be a very good dendrogram, and so I can get very good groups out of it. From (2) and (3) I get the same dendrogram and the groups that I can get out of it aren't as good as the ones I get from (1)
I'm saying that the groups are bad/good because at the moment I have a somewhat little volume of data to analyse, and so I can check them very easily.
Does this that I'm getting makes any sense? There is something that justify this? Some suggestion on how to cluster with a similarity matrizx. Is there a better way to visualize a similarity matrix than a dendrogram?
You can visualize a similarity matrix using a heatmap (for example, using the heatmaply R package).
You can check if a dendrogram fits by using the dendextend R package function cor_cophenetic (use the most recent version from github).
Clustering which is based on distance can be done using hclust, but also using cluster::pam (k-medoids).

Does mat2listw function in R return a row-standardized spatial weight matrix?

In the discussion in the accepted answer in this question, user3050574 said that:
"... it is my understanding that mat2listw creates a row standardized weight matrix from a matrix that is currently just in binary form."
This is the only place that I read about this kind of saying. In the "spdep" R document, it is said that
"The function converts a square spatial weights matrix, optionally a sparse matrix to a weight list object, ..."
Does this conversion include row-standardizing?
I have a weight matrix with each element as the exact weight that I want to apply. Therefore it's crucial to me to be certain about whether the mat2listw function generates a row-standardized weight matrix or not.
This is puzzling me as well. I also have a weight matrix I want to apply to my estimations. Spml allows to use either a Matrix for weights or a listw. So I tried both and compared the results. It turned out that the estimation with the matrix itself and the listw obtained via mat2listw delivers the same results (I think this supports the idea that mat2listw does not row standardize by default).
However when I apply the impacts() function to my output, I get the following error: Error in impacts.splm(b1, listw = lw1) :
Only row-standardised weights supported

Resources