Different results between fpc::dbscan and dbscan::dbscan - r

I want to implement DBSCAN in R on some GPS coordinates. I have a distance matrix (dist_matrix) that I fed into the following functions:
dbscan::dbscan(dis_matrix, eps=50, minPts = 5,borderPoints=TRUE)
fpc::dbscan(dis_matrix,eps = 50,MinPts = 5,method = "dist")
and Im getting very different results from both functions in terms of number of clusters and if a point is a noise point or belongs to a cluster. Basically, the results are inconsistent between two algorithms. I have no clue why they generate these very different results although here
http://www.sthda.com/english/wiki/wiki.php?id_contents=7940
we see for iris data, both functions did the same.
my distance matrix [is coming from a function (geosphere::distm) which calculates the spatial distance between more than 2000 coordinates.
Furthermore, I coded dbscan according to this psuedo-code
source: https://cse.buffalo.edu/~jing/cse601/fa13/materials/clustering_density.pdf
My results are equal to what I obtained from fpc package.
Can anyone notice why they are different. I already looked into both functions and haven't found anything.

The documentation of geosphere::distm says that it does not return a dist object but a matrix. dbscan::dbscan assumes that you have a data matrix and not distances. Convert your matrix into a dist object with as.dist first. THis should resolve the problem.

Related

Understanding R subspace package clustering output

Somewhat related to this question.
I am using R subspace package for subspace clustering. As in the question above, I have failed to use the generic plotting method to plot out my resulting clusters in a way native to the package. The next step is to understand the output of the command
CLIQUE(df, xi = 40, tau = 0.2)
That looks something like this:
I understand that the "object" is the row number for the clustered unit, and the subspace indicates the dimensions of the data in which the clustering was done. However I don't see how the clusters in the given dimensions can be distinguished.
The documentation does not contain information on the output. Ideally, my goal is to plot out all the clusters with something like ggplot2, or in 3D, what have you. And I need to know which units are in which clusters in corresponding dimensions.
Additionally, checked if the dimensions of any of the two members of the output list are the same like this:
cluster_result <- clique_model
equalities_matrix <- matrix(
0L, nrow = length(cluster_result), ncol = length(cluster_result)
)
for (i in 1:length(cluster_result)){
for (j in 1:length(cluster_result)){
equalities_matrix[i,j] <- (
all(cluster_result[[i]]$subspace == cluster_result[[j]]$subspace)
)
}
}
sum(equalities_matrix)
The answer is no.
So, here is what my researches lead to, might be helpful for someone in the future.
Above, the CLIQUE algorithm above outputted one cluster per every set of dimensions, possibly by the virtue of the data or tuning of the algorithm. I added more features to the data, ran it again, and checked if the dimensions of any of the two members of the output list are the same again. This time, yes, several dimensions were the same as the sum(equalities_matrix) yielded a number larger than the number of features.
In conclusion, the output of the algorithm is a list of lists where each member list represents one cluster in one subspace:
subspace ... the dimensions making the subspace indicated with TRUE,
objects ... members of the cluster.
If there is more than one cluster in a given subspace, there will be more member lists with the same subspace, and different members of the cluster.
Here are the papers that helped me understand the theory:
Parsons, Haque, and Liu; 2004
Agrawal Gehrke, Gunopulos, Raghavan

PCA : eigen values vs eigen vectors vs loadings in python vs R?

I am trying to calculate PCA loadings of a dataset. The more I read about it, the more I get confused because "loadings" is used differently at many places.
I am using sklearn.decomposition in python for PCA analysis as well as R (using factomineR and factoextra libraries) as it provides easy visualization techniques. The following is my understanding:
pca.components_ give us the eigen vectors. They give us the directions of maximum variation.
pca.explained_variance_ give us the eigen values associated with the eigen vectors.
eigenvectors * sqrt(eigen values) = loadings which tell us how principal components (pc's) load the variables.
Now, what I am confused by is:
Many forums say that eigen vectors are the loadings. Then, when we multiply the eigen vectors by the sqrt(eigen values) we just get the strength of association. Others say eigenvectors * sqrt(eigen values) = loadings.
Eigen vectors squared tells us the contribution of variable to pc? I believe this is equivalent to var$contrib in R.
loading squared (eigen vector or eigenvector*sqrt(eigenvalue) I don't know which one) shows how well a pc captures a variable (closer to 1 = variable better explained by a pc). Is this equivalent of var$cos2 in R? If not what is cos2 in R?
Basically I want to know how to understand how well a principal component captures a variable and what is the contribution of a variable to a pc. I think they both are different.
What is pca.singular_values_? It is not clear from the documentation.
These first and second links that I referred which contains R code with explanation and the statsexchange forum that confused me.
Okay, after much research and going through many papers I have the following,
pca.components_ = eigen vectors. Take a transpose so that pc's are columns and variables are rows.
1.a: eigenvector**2 = variable contribution in principal components. If it's close to 1 then a particular pc is well explained by that variable.
In python -> (pow(pca.components_.T),2) [Multiply with 100 if you want percentages and not proportions] [R equivalent -> var$contrib]
pca.variance_explained_ = eigen values
pca.singular_values_ = singular values obtained from SVD.
(singular values)**2/(n-1) = eigen values
eigen vectors * sqrt(eigen values) = loadings matrix
4.a: vertical sum of squared loading matrix = eigen values. (Given you have taken transpose as explained in step 1)
4.b: horizontal sum of squared loading matrix = observation's variance explained by all principal components -How much all pc's retain a variables variance after transformation. (Given you have taken transpose as explained in step 1)
In python-> loading matrix = pca.components_.T * sqrt(pca.explained_variance_).
For questions pertaining to r:
var$cos2 = var$cor (Both matrices are same). Given the coordinates of the variables on a factor map, how well it is represented by a particular principal component. Seems like variable and principal component's correlation.
var$contrib = Summarized by point 1. In r:(var.cos2 * 100) / (total cos2 of the component) PCA analysis in R link
Hope it helps others who are confused by PCA analysis.
Huge thanks to -- https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another

R and SPSS: Different results for Hierarchical Cluster Analysis

I'm performing hierarchical cluster analysis using Ward's method on a dataset containing 1000 observations and 37 variables (all are 5-point likert-scales).
First, I ran the analysis in SPSS via
CLUSTER Var01 to Var37
/METHOD WARD
/MEASURE=SEUCLID
/ID=ID
/PRINT CLUSTER(2,10) SCHEDULE
/PLOT DENDROGRAM
/SAVE CLUSTER(2,10).
FREQUENCIES CLU2_1.
I additionaly performed the analysis in R:
datA <- subset(dat, select = Var01:Var37)
dist <- dist(datA, method = "euclidean")
hc <- hclust(d = dist, method = "ward.D2")
table(cutree(hc, k = 2))
The resulting cluster sizes are:
1 2
SPSS 712 288
R 610 390
These results are obviously confusing to me, as they differ substentially (which becomes highly visible when observing the dendrograms; also applies for the 3-10 clusters solutions). "ward.D2" takes into account the squared distance, if I'm not mistaken, so I included the simple distance matrix here. However, I tried several (combinations) of distance and clustering methods, e.g. EUCLID instead of SEUCLID, squaring the distance matrix in R, applying "ward.D" method,.... I also looked at the distance matrices generated by SPSS and R, which are identical (when applying the same method). Ultimately, I excluded duplicate cases (N=29) from my data, guessing that those might have caused differences when being allocated (randomly) at a certain point. All this did not result in matching outputs in R and SPSS.
I tried running the analysis with the agnes() function from the cluster package, which resulted in - again - different results compared to SPSS and even hclust() (But that's a topic for another post, I guess).
Are the underlying clustering procedures that different between the programs/packages? Or did I overlook a crucial detail? Is there a "correct" procedure that replicates the results yielded in SPSS?
If the distance matrices are identical and the merging methods are identical, the only thing that should create different outcomes is having tied distances handled differently in two algorithms. Tied distances might be present with the original full distance matrix, or might occur during the joining process. If one program searches the matrix and finds two or more distances tied at the minimum value at that step, and it selects the first one, while another program selects the last one, or one or both select one at random from among the ties, different results could occur.
I'd suggest starting with a small example with some data with randomness added to values to make tied distances unlikely and see if the two programs produce matching results on those data. If not, there's a deeper problem. If so, then tie handling might be the issue.

convert a list -class numeric- into a distance structure in R

I have a list that looks like this, it is a measure of dispersion for each sample.
1 2 3 4 5
0.11829384 0.24987017 0.08082147 0.13355495 0.12933790
To further analyze this I need it to be a distance structure, the -vegan- package need it as a 'dist' object.
I found some solutions that applies to matrices > dist, but how could I change this current data into a dist object?
I am using the FD package, at the manual I found,
Still, one potential advantage of FDis over Rao’s Q is that in the unweighted case
(i.e. with presence-absence data), it opens possibilities for formal statistical tests for differences in
FD between two or more communities through a distance-based test for homogeneity of multivariate
dispersions (Anderson 2006); see betadisper for more details
I wanted to use vegan betadisper function to test if there are differences among different regions (I provided this using element "region" with column "region" too)
functional <- FD(trait, comun)
mod <- betadisper(functional$FDis, region$region)
using gowdis or fdisp from FD didn't work too.
distancias <- gowdis(rasgo)
mod <- betadisper(distancias, region$region)
dispersion <- fdisp(distancias, presence)
mod <- betadisper(dispersion, region$region)
I tried this but I need a list object. I thought I could pass those results to betadisper.
You cannot do this: FD::fdisp() does not return dissimilarities. It returns a list of three elements: the dispersions FDis for each sampling unit (SU), and the results of the eigen decomposition of input dissimilarities (eig for eigenvalues, vectors for orthonormal eigenvectors). The FDis values are summarized for each original SU, but there is no information on the differences among SUs. The eigen decomposition can be used to reconstruct the original input dissimilarities (your distancias from FD::gowdis()), but you can directly use the input dissimilarities. Function FD::gowdis() returns a regular "dist" structure that you can directly use in vegan::betadisper() if that gives you a meaningful analysis. For this, your grouping variable must be based on the same units as your distancias. In typical application of fdisp, the units are species (taxa), but it seems you want to get analysis for communities/sites/whatever. This will not be possible with these tools.

DBSCAN Clustering with additional features

Can I apply DBSCAN with other features in addition to location ? and if it is available how can it be done through R or Spark ?
I tried preparing an R table of 3 columns one for latitude, longitude and score (the feature I wanna cluster upon in addition to space feature) and when tried running DBSCAN with the following R code, I get the following plot which tells that the algorithm makes clusters upon each pair of columns (long, lat), (long, score), (lat, score), ...
my R Code:
df = read.table("/home/ahmedelgamal/Desktop/preparedData")
var = dbscan(df, eps = .013)
plot(x = var, data = df)
and the plot I get:
You are misinterpreting the plot.
You don't get one result per plot, but all plots show the same clusters, only in different attributes.
But you also have the issue that the R version is (to my knowledge) only fast for Euclidean distance.
In your current code, points are neighbors if (lat[i]-lat[j])^2+(lon[i]-lon[j])^2+(score[i]-score[j])^2 <= eps^2. This bad because: 1. latitude and longitude are not Euclidean, you should be using haversine instead, and 2. your additional attribute has much larger scale and thus you pretty much only cluster points with near-zero score, and 3) your score attribute is skewed.
For this problrm you should probably be using Generalized DBSCAN. Points are similar if their haversine distance is less than e.g. 1 mile (you want to measure geographic distance here, not coordinates, because of distortion) and if their score differs by a factor of at most 1.1 (i.e. compare score[y] / score[x] or work in logspace?). Since you want both conditipns to hold, the usual Euclidean DBSCAN implementation is not yet enough, but you need a Generalized DBSCAN that allows multiple conditions. Look for an implementation of Generalized DBSCAN instead (I believe there id one in ELKI that you may be able to access from Spark), or implement it yourself. It's not very hard to do.
If quadratic runtime is okay for you, you can probably use any distance-matrix-based DBSCAN, and simply "hack" a binary distance matrix:
compute Haversine distances
compute Score dissimilarity
distance = 0 if haversine < distance-threshold and score-dissimilarity < score-threshold, otherwise 1.
run DBSCAN with precomputed distance matrix and eps=0.5 (since it is a binary matrix, don't change eps!)
It's reasonably fast, but needs O(n^2) memory. In my experience, the indexes of ELKI yield a good speedup if you have larger data, and are worth a try if you run out of memory or time.
You need to scale your data. V3 has a range which is much larger than the range for the V1 and V2 and thus DBSCAN currently mostly ignores V3.

Resources