How to get the Summary Statistics and do a K-means Cluster analysis In R for Row Matrices in Loop - r

enter image description hereMy primary objective is to analyze a video by splitting it into frames of RGB images and then converting this RGB image into separate color matrices. I did that in MATLAB and imported it into R for analysis. I have 125 frames each with 3 color matrices (R,G and B). I would like to create a loop to get the summary statistics of each color matrix and also do a K-means cluster analysis on each of the matrices.
I did the summary statistics manually for each matrix, which was fine for this video. However, I have several more so I need to find a way to loop it. I tried to do the cluster analysis on red channel of frame 1 using the package factoextra and cluster.enter image description here
df <- Frame1[["redChannel"]]
fviz_nbclust(df, kmeans, method = "wss")
km <- kmeans(df, centers = 3, nstart = 25)
View(km)
kmRed1 <- kmeans(df, centers = 4, nstart = 25)
View(kmRed1)
View(kmRed1)
kmRed1
K-means clustering with 4 clusters of sizes 4, 886, 8, 14
One of the clusters has 886 data pints and others just have single digits. I don't understand what it means. I would like to loop this as well for all the other matrices and when I try to plot the cluster using this code:
fviz_cluster(km, data = df)
I am getting this error message:
Error in prcomp.default(data, scale = FALSE, center = FALSE) :
cannot rescale a constant/zero column to unit variance
I am new To R and would appreciate any suggestions.

Related

Hierarchical clustering for centers of kmeans in R

I have a huge data set (200,000 rows * 40 columns) where each row represents an observation and each column is a variable. For this data, I would like to do hierarchical clustering. Unfortunately, as the number of rows is huge, then it is impossible to do this using my computer since I need to compute the distance matrix for all pairs of observations so (200,000 * 200,000) matrix.
The answer of this question suggests to use first kmeans to calculate a number of centers, then to perform the hierarchical clustering on the coordinates of these centers using the library FactoMineR.
The problem: I keep getting an error when applying the same method!
#example
# Data
MyData <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
kClust_MyData <- kmeans(MyData, 1000, iter.max=20)
Hclust_MyData <- HCPC(kClust_MyData$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(Hclust_MyData, choice="tree")
But
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w = res.sauv$call$row.w.init) :
object 'data.clust' not found
The package fastcluster has a method hclust.vector that does not require a distance matrix as input, but computes the distances itself in a more memory efficient way. From the fastcluster manual:
The call
hclust.vector(X, method='single', metric=[...])
is equivalent to
hclust(dist(X, metric=[...]), method='single')
but uses less memory and is equally fast

Draw heatmaps of each k-means cluster?

I have a large dataset composed of numerical observations. For this data set, I calculated k-means for which I defined 6 clusters. How can I draw a heatmap, of each cluster? When I try the following, I get an error:
clusters <- kmeans(dataset, 6)
heatmap(clusters$cluster)
So basically, what you can do is to subset further, and use a for loop. The result of clusters is a list, in which centers, or k (clusters), are the sets of observations you specify in the kmeans function.
clusters <- kmeans(dataset, k = 6)
for (i in c(1:k)) {
pheatmap(dataset[names(clusters$cluster[clusters$cluster == i]), ])
}
Here, if k = 6, you get 6 heatmaps, one of each cluster of observations from your dataset.

Cluster data using medoids (cluster centers) in R

I have a dataframe with three features as
library(cluster)
df <- data.frame(f1=rnorm(480,30,1),
f2=rnorm(480,40,0.5),
f3=rnorm(480,50, 2))
Now, I want to do clustering using K-medoids in two steps. In step 1, using some data from df I want to get medoids (cluster centers), and in step 2, I want to use obtained medoids to do clustering on remaining data. Accordingly,
# find medoids using some data
sample_data <- df[1:240,]
sample_data <- scale(sample_data) # scaling features
clus_res1 <- pam(sample_data,k = 4,diss=FALSE)
# Now perform clustering using medoids obtained from above clustering
test_data <- df[241:480,]
test_data <- scale(test_data)
clus_res2 <- pam(test_data,k = 4,diss=FALSE,medoids=clus_res1$medoids)
With this script, I get an error message as
Error in pam(test_data, k = 4, diss = FALSE, medoids = clus_res1$medoids) :
'medoids' must be NULL or vector of 4 distinct indices in {1,2, .., n}, n=240
It is clear that error message is due to the input format of Medoid matrix. How can I convert this matrix to the vector as specified in the error message?
The initial medoids parameter expects index numbers of points in your data set. So 42,17 means to use objects 42 and 17 as initial medoids.
By the definition of medoids, you can only use points of your data set as medoids, not other vectors!
Clustering is unsupervised. No need to split your data in training/test, because there are no labels to overfit to in unsupervised learning.
Notice that in PAM the clustering center is an observation, that is you get 4 observations that each of them is a center of cluster. Demonstration of PAM.
So if you want to try and use the same center, you need to find the observations which are closest to the observations who are the center in your train.

hclust() in R on large datasets

I am trying implement hierarchical clustering in R : hclust() ; this requires a distance matrix created by dist() but my dataset has around a million rows, and even EC2 instances run out of RAM. Is there a workaround?
One possible solution for this is to sample your data, cluster the smaller sample, then treat the clustered sample as training data for k Nearest Neighbors and "classify" the rest of the data. Here is a quick example with 1.1M rows. I use a sample of 5000 points. The original data is not well-separated, but with only 1/220 of the data, the sample is separated. Since your question referred to hclust, I used that. But you could use other clustering algorithms like dbscan or mean shift.
## Generate data
set.seed(2017)
x = c(rnorm(250000, 0,0.9), rnorm(350000, 4,1), rnorm(500000, -5,1.1))
y = c(rnorm(250000, 0,0.9), rnorm(350000, 5.5,1), rnorm(500000, 5,1.1))
XY = data.frame(x,y)
Sample5K = sample(length(x), 5000) ## Downsample
## Cluster the sample
DM5K = dist(XY[Sample5K,])
HC5K = hclust(DM5K, method="single")
Groups = cutree(HC5K, 8)
Groups[Groups>4] = 4
plot(XY[Sample5K,], pch=20, col=rainbow(4, alpha=c(0.2,0.2,0.2,1))[Groups])
Now just assign all other points to the nearest cluster.
Core = which(Groups<4)
library(class)
knnClust = knn(XY[Sample5K[Core], ], XY, Groups[Core])
plot(XY, pch=20, col=rainbow(3, alpha=0.1)[knnClust])
A few quick notes.
Because I created the data, I knew to choose three clusters. With a real problem, you would have to do the work of figuring out an appropriate number of clusters.
Sampling 1/220 could completely miss any small clusters. In the small sample, they would just look like noise.

Hierarchical Cluster using dissimilarity matrix R

I have mixed data type matrix Data_string size (947 x 41) that contain numeric and categorical attributes.
I produced a distance matrix (947 x 947) using the daisy() function and Gower distance measure in Rstudio.
d <- daisy(Data_String, metric = "gower", stand = FALSE,type = list(symm = c("V1","V13") , asymm = c("V8","V9","V10")))
I applied hierarchical Cluster using dissimilarity matrix (d).
# hclust
hc <- hclust(d, method="complete")
plot(hc)
rect.hclust(hc, 4)
cut <- cutree(hc, k = 1:5)
View(cut)
#Diana
d_as <- as.matrix(d)
DianaCluster <- diana(d_as, diss = TRUE, keep.diss = TRUE)
print(DianaCluster)
plot(DianaCluster)
The following is the plots I had.
** Note: I couldn't upload the image here since I do not have enough reputation's points.
I am struggling to understand the results, can anyone please
1- suggest any solution that I can apply in R to simplify the understanding of my results.
or
2- how I can link it to my source data, since all the results are based on the dissimilarity matrix.
Please take a look at -
https://stats.stackexchange.com/questions/130974/how-to-use-both-binary-and-continuous-variables-together-in-clustering
It explains how to use gower dissimilarity matrix with hclust. Hope this helps!

Resources