R - warning for dissimilarity calculation, clustering with numeric matrix - r

Reproducible data:
Data <- data.frame(
X = sample(c(0,1), 10, replace = TRUE),
Y = sample(c(0,1), 10, replace = TRUE),
Z = sample(c(0,1), 10, replace = TRUE)
)
Convert dataframe to matrix
Matrix_from_Data <- data.matrix(Data)
Check the structure
str(Matrix_from_Data)
num [1:10, 1:3] 1 0 0 1 0 1 0 1 1 1 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "X" "Y" "Z"
The question:
I have dataframe of binary, symmetric variables (larger than the example), and I'd like to do some hierarchical clustering, which I've never tried before. There are no missing or NA values.
I convert the dataframe into a matrix before attempting to run the daisy function from the 'cluster' package, to get the dissimilarity matrix. I'd like to explore the options for calculating different dissimilarity metrics, but am running into a warning (not an error):
library(cluster)
Dissim_Euc_Matrix_from_Data <- daisy(Matrix_from_Data, metric = "euclidean", type = list(symm =c(1:ncol(Matrix_from_Data))))
Warning message:
In daisy(Matrix_from_Data, metric = "euclidean", type = list(symm = c(1:ncol(Matrix_from_Data)))) :
with mixed variables, metric "gower" is used automatically
...which seems weird to me, since "Matrix_from_Data" is all numeric variables, not mixed variables. Gower might be a fine metric, but I'd like to see how the others impact the clustering.
What am I missing?

Great question.
First, that message is a Warning and not an Error. I'm not personally familiar with daisy, but my ignorant guess is that that particular warning message pops up when you run the function and doesn't do any work to see if the warning is relevant.
Regardless of why that warning appears, one simple way to compare the clustering done by several different distances measures in hierarchical clustering is to plot the dendograms. For simplicity, let's compare the "euclidean" and "binary" distance metrics programmed into dist. You can use ?dist to read up on what the "binary" distance means here.
# When generating random data, always set a seed if you want your data to be reproducible
set.seed(1)
Data <- data.frame(
X = sample(c(0,1), 10, replace = TRUE),
Y = sample(c(0,1), 10, replace = TRUE),
Z = sample(c(0,1), 10, replace = TRUE)
)
# Create distance matrices
mat_euc <- dist(Data, method="euclidean")
mat_bin <- dist(Data, method="binary")
# Plot the dendograms side-by-side
par(mfrow=c(1,2))
plot(hclust(mat_euc))
plot(hclust(mat_bin))
I generally read dendograms from the bottom-up since points lower on the vertical axis are more similar (i.e. less distant) to one another than points higher on the vertical axis.
We can pick up a few things from these plots:
4/6, 5/10, and 7/8 are grouped together using both metrics. We should hope this is true if the rows are identical :)
3 is most strongly associated with 7/8 for both distance metrics, although the degree of association is a bit stronger in the binary distance as opposed to the Euclidean distance.
1, 2, and 9 have some notably different relationships between the two distance metrics (e.g. 1 is most strongly associated with 2 in Euclidean distance but with 9 in binary distance). It is in situations like this where the choice of distance metric can have a significant impact on the resulting clusters. At this point it pays to go back to your data and understand why there are differences between the distance metrics for these three points.
Also remember that there are different methods of hierarchical clustering (e.g. complete linkage and single linkage), but you can use this same approach to compare the differences between methods as well. See ?hclust for a complete list of methods provided by hclust.
Hope that helps!

Related

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

Clustering leads to very concentrated clusters

To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8
Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.
Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.
The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.
With help from another one, I managed to create a distance matrix and use the PAM clustering.
Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.
Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.
Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?
The whole code and results:
#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1
Now using the PAM clustering
dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))
But I get very concentrated clusters, as:
1 2 3 4
382 100 23 62
I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.
dim(df) # [1] 7659 3
test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318 3
length(unique(test$V1)) # 567
length(unique(test$V2)) # 567
test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567
#Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables).
Continuing further, See this post on correlation detection in R;
# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables.
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"
removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32 4
Hope this helps.

Louvain community detection in R using igraph - format of edges and vertices

I have a correlation matrix of scores that I would like to run community detection on using the Louvain method in igraph, in R. I converted the correlation matrix to a distance matrix using cor2dist, as below:
distancematrix <- cor2dist(correlationmatrix)
This gives a 400 x 400 matrix of distances from 0-2. I then made the list of edges (the distances) and vertices (each of the 400 individuals) using the below method from http://kateto.net/networks-r-igraph (section 3.1).
library(igraph)
test <- as.matrix(distancematrix)
mode(test) <- "numeric"
test2 <- graph.adjacency(test, mode = "undirected", weighted = TRUE, diag = TRUE)
E(test2)$weight
get.edgelist(test2)
From this I then wrote csv files of the 'from' and 'to' edge list, and corresponding weights:
edgeweights <-E(test2)$weight
write.csv(edgeweights, file = "edgeweights.csv")
fromtolist <- get.edgelist(test2)
write.csv(fromtolist, file = "fromtolist.csv")
From these two files I produced a .csv file called "nodes.csv" which simply had all the vertex IDs for the 400 individuals:
id
1
2
3
4
...
400
And a .csv file called "edges.csv", which detailed 'from' and 'to' between each node, and provided the weight (i.e. the distance measure) for each of these edges:
from to weight
1 2 0.99
1 3 1.20
1 4 1.48
...
399 400 0.70
I then tried to use this node and edge list to create an igraph object, and run louvain clustering in the following way:
nodes <- read.csv("nodes.csv", header = TRUE, as.is = TRUE)
edges <- read.csv("edges.csv", header = TRUE, as.is = TRUE)
clustergraph <- graph_from_data_frame(edges, directed = FALSE, vertices = nodes)
clusterlouvain <- cluster_louvain(clustergraph)
Unfortunately this did not do the louvain community detection correctly. I expected this to return around 2-4 different communities, which could be plotted similarly to here, but sizes(clusterlouvain) returned:
Community sizes
1
400
indicating that all individuals were sorted into the same community. The clustering also ran immediately (i.e. with almost no computation time), which also makes me think it was not working correctly.
My question is: Can anyone suggest why the cluster_louvain method did not work and identified just one community? I think I must be specifying the distance matrix or edges/nodes incorrectly, or in some other way not giving the correct input to the cluster_louvain method. I am relatively new to R so would be very grateful for any advice. I have successfully used other methods of community detection on the same distance matrix (i.e. k-means) which identified 2-3 communities, but would like to understand what I have done wrong here.
I'm aware there are multiple other queries about using igraph in R, but I have not found one which explicitly specifies the input format of the edges and nodes (from a correlation matrix) to get the louvain community detection working correctly.
Thank you for any advice! I can provide further information if helpful.
I believe that cluster_louvain did exactly what it should do with your data.
The problem is your graph.Your code included the line get.edgelist(test2). That must produce a lot of output. Instead try, this
vcount(test2)
ecount(test2)
Since you say that your correlation matrix is 400x400, I expect that you will
get that vcount gives 400 and ecount gives 79800 = 400 * 399 / 2. As you have
constructed it, every node is directly connected to all other nodes. Of course there is only one big community.
I suspect that what you are trying to do is group variables that are correlated.
If the correlation is near zero, the variables should be unconnected. What seems less clear is what to do with variables with correlation near -1. Do you want them to be connected or not? We can do it either way.
You do not provide any data, so I will illustrate with the Ionosphere data from
the mlbench package. I will try to mimic your code pretty closely, but will
change a few variable names. Also, for my purposes, it makes no sense to write
the edges to a file and then read them back again, so I will just directly
use the edges that are constructed.
First, assuming that you want variables with correlation near -1 to be connected.
library(igraph)
library(mlbench) # for Ionosphere data
library(psych) # for cor2dist
data(Ionosphere)
correlationmatrix = cor(Ionosphere[, which(sapply(Ionosphere, class) == 'numeric')])
distancematrix <- cor2dist(correlationmatrix)
DM1 <- as.matrix(distancematrix)
## Zero out connections where there is low (absolute) correlation
## Keeps connection for cor ~ -1
## You may wish to choose a different threshhold
DM1[abs(correlationmatrix) < 0.33] = 0
G1 <- graph.adjacency(DM1, mode = "undirected", weighted = TRUE, diag = TRUE)
vcount(G1)
[1] 32
ecount(G1)
[1] 140
Not a fully connected graph! Now let's find the communities.
clusterlouvain <- cluster_louvain(G1)
plot(G1, vertex.color=rainbow(3, alpha=0.6)[clusterlouvain$membership])
If instead, you do not want variables with negative correlation to be connected,
just get rid of the absolute value above. This should be much less connected
DM2 <- as.matrix(distancematrix)
## Zero out connections where there is low correlation
DM2[correlationmatrix < 0.33] = 0
G2 <- graph.adjacency(DM2, mode = "undirected", weighted = TRUE, diag = TRUE)
clusterlouvain <- cluster_louvain(G2)
plot(G2, vertex.color=rainbow(4, alpha=0.6)[clusterlouvain$membership])

Cluster data using medoids (cluster centers) in R

I have a dataframe with three features as
library(cluster)
df <- data.frame(f1=rnorm(480,30,1),
f2=rnorm(480,40,0.5),
f3=rnorm(480,50, 2))
Now, I want to do clustering using K-medoids in two steps. In step 1, using some data from df I want to get medoids (cluster centers), and in step 2, I want to use obtained medoids to do clustering on remaining data. Accordingly,
# find medoids using some data
sample_data <- df[1:240,]
sample_data <- scale(sample_data) # scaling features
clus_res1 <- pam(sample_data,k = 4,diss=FALSE)
# Now perform clustering using medoids obtained from above clustering
test_data <- df[241:480,]
test_data <- scale(test_data)
clus_res2 <- pam(test_data,k = 4,diss=FALSE,medoids=clus_res1$medoids)
With this script, I get an error message as
Error in pam(test_data, k = 4, diss = FALSE, medoids = clus_res1$medoids) :
'medoids' must be NULL or vector of 4 distinct indices in {1,2, .., n}, n=240
It is clear that error message is due to the input format of Medoid matrix. How can I convert this matrix to the vector as specified in the error message?
The initial medoids parameter expects index numbers of points in your data set. So 42,17 means to use objects 42 and 17 as initial medoids.
By the definition of medoids, you can only use points of your data set as medoids, not other vectors!
Clustering is unsupervised. No need to split your data in training/test, because there are no labels to overfit to in unsupervised learning.
Notice that in PAM the clustering center is an observation, that is you get 4 observations that each of them is a center of cluster. Demonstration of PAM.
So if you want to try and use the same center, you need to find the observations which are closest to the observations who are the center in your train.

Kappa Statistic Extremely Large/Sparse matrix

I have a large sparseMatrix (mat):
138493 x 17694 sparse Matrix of class "dgCMatrix", with 10000132 entries
I want to investigate Inter-rating agreement using kappa statistics but when I run Fleiss:
kappam.fleiss(mat)
I am shown the following error
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Is this due to my matrix being too large?
Is there any other methods I can use to calculate kappa statistics for IRR on a matrix this large?
The best answer that I can offer is that this is not really possible due to the extreme sparsity in your matrix. The problem: With 10,000,132 entries for a 138,493 * 17694 = 2,450,495,142 cell matrix, you have mostly (99.59%) missing values. The irr package allows for these but here you are placing some extreme demands on the system, by asking it to compare ratings for users whose films do not overlap.
This is compounded by the problem that the methods in the irr package a) require dense matrixes as input, and b) (at least in kripp.alpha() loop over columns making them very slow.
Here is an illustration constructing a matrix similar in nature to yours (but with no pattern - in reality your situation will be better because viewers tend to rate similar sets of movies).
Note that I used Krippendorff's alpha here, since it allows for ordinal or interval ratings (as your data suggests), and normally handles missing data fine.
require(Matrix)
require(irr)
seed <- 100
(sparseness <- 1 - 10000132 / (138493 * 17694))
## [1] 0.9959191
138493 / 17694 # multiple of movies to users
## [1] 7.827117
# nraters <- 17694
# nusers <- 138493
nmovies <- 100
nusers <- 783
raterMatrix <-
Matrix(sample(c(NA, seq(0, 5, by = .5)), nmovies * nusers, replace = TRUE,
prob = c(sparseness, rep((1-sparseness)/11, 11))),
nrow = nmovies, ncol = nusers)
kripp.alpha(t(as.matrix(raterMatrix)), method = "interval")
## Krippendorff's alpha
##
## Subjects = 100
## Raters = 783
## alpha = -0.0237
This worked for that size matrix, but fails if I increase it 100x (10x on each dimension), keeping the same proportions as in your reported dataset, then it fails to produce an answer after even 30 minutes, so I killed the process.
What to conclude: You are not really asking the right question of this data. It's not an issue of how many users agreed, but probably what sort of dimensions exist in this data in terms of clusters of viewing and clusters of preferences. You probably want to use association rules or some dimensional reduction methods that don't balk at the sparsity in your dataset.

Resources