I have a set of data containing:
item, associated cluster, silhouette coefficient. I can further augment this data set with more information if necessary.
I would like to generate a silhouette plot in R. I am having trouble with this because examples I came across use the built-in kmeans (or related) clustering function and plot the result. I want to bypass this step and produce the plot for my own clustering algorithm but I'm ending up short on providing the correct arguments to the plot function.
Thank you.
EDIT
Data set example https://pastebin.mozilla.org/8853427
What I've tried is loading the dataset and passing it to the plot function using various arguments based on https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/silhouette.html
Function silhouette in package cluster can do the plots for you. It just needs a vector of cluster membership (produced from whatever algorithm you choose) and a dissimilarity matrix (probably best to use the same one used in producing the clusters). For example:
library (cluster)
library (vegan)
data(varespec)
dis = vegdist(varespec)
res = pam(dis,3) # or whatever your choice of clustering algorithm is
sil = silhouette (res$clustering,dis) # or use your cluster vector
windows() # RStudio sometimes does not display silhouette plots correctly
plot(sil)
EDIT: For k-means (which uses squared Euclidean distance)
library (vegan)
library (cluster)
data(varespec)
dis = dist(varespec)^2
res = kmeans(varespec,3)
sil = silhouette (res$cluster, dis)
windows()
plot(sil)
Related
I use MCLUST and specifically specify K=3 clusters, with the covariance matrix type is VII.
library(mclust)
mc <- Mclust(iris[,-5], G = 2)
How to create a figure like below? It's from my textbook: Applied Multivariate Statistical Analysis by Johnson and Wichern. Notice that this figure has 2 clusters (squares and triangles) in each figure. So the textbook has a mistake here. The textbook used 2 clusters.
If you would like to modify the shape based on cluster assignment, you can do so through the use of pch. Using your data:
pairs(mc$data, pch = mc$classification)
If you want to change the shapes, you can map the classification assignment to the desired shape.
I’m trying to use multidimensional scaling (MDS) in R.
Can I predict new values on test set based on the values that I receive from my training set?
I’m looking for something similar to what I’ve done in PCA for example:
prin_comp <- prcomp(pca.train, scale. = FALSE)
test.data <- predict(prin_comp, newdata = pca.test)
Thank you,
Ittai
You can use MDS as the first of a three step process.
Generate the MDS coordinates
Apply a traditional clustering algorithm to the generated coordinates
E.g. Kmeans with kmeans(x, K) where you will need to supply the K=number of clusters
Note that you will probably want to do some metrics of the generated clusters by
cross validation to ensure they are providing good labels for your existing data.
Use the kmeans clusters to find the nearest centroid/cluster for each of
the new data
Then you have a decision to make (as the modeler): do you apply the mode of the chosen cluster as the label for your new data? That is the simplest solution - but there can be other approaches.
In addition to what you wrote, can't I use the predict function based on the coefficient from the training model, and use the test data to predict new MDS values?
I have conducted an NMDS analysis and have plotted the output too. However, I am unsure how to actually report the results from R. Which parts from the following output are of most importance? The graph that is produced also shows two clear groups, how are you supposed to describe these results?
MDS.out
Call:
metaMDS(comm = dgge2, distance = "bray")
global Multidimensional Scaling using monoMDS
Data: dgge2
Distance: bray
Dimensions: 2
Stress: 0
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘dgge2’
The most important pieces of information are that stress=0 which means the fit is complete and there is still no convergence. This happens if you have six or fewer observations for two dimensions, or you have degenerate data. You should not use NMDS in these cases. Current versions of vegan will issue a warning with near zero stress. Perhaps you had an outdated version.
I think the best interpretation is just a plot of principal component. yOu can use plot and text provided by vegan package. Here I am creating a ggplot2 version( to get the legend gracefully):
library(vegan)
library(ggplot2)
data(dune)
ord = metaMDS(comm = dune)
ord_spec <- scores(ord, "spec")
ord_spec <- cbind.data.frame(ord_spec,label=rownames(ord_spec))
ord_sites <- scores(ord, "sites")
ord_sites <- cbind.data.frame(ord_sites,label=rownames(ord_sites))
ggplot(data=ord_spec,aes(x=NMDS1,y=NMDS2)) +
geom_text(aes(label=label,col='species')) +
geom_text(data=ord_sites,aes(label=label,col='sites'))
I need to generate a random multidimensional clustered data. For this I want to generate few uniform distributed multidimensional points (centers) and then many normal distributed points around each of them. How can I set the vector (multidimensional point) as mean for the normal distribution? I see the function rnorm can get vectors as mean and sd parameters, but I really don't understand how it works.
Package mnormt, function rmnorm()
set.seed(2)
require(mnormt)
varcov <- matrix(rchisq(4, 2), 2)
varcov <- varcov + t(varcov)
rmnorm(1000, mean=c(0,1), varcov=varcov)
Function rmvnorm.mixt() in package ks is another good alternative. If you install this package and open the vignette file (ks: Kernel density estimation for bivariate data) you can access an example for building a 'dumbbell' density with this function (see page 1). But you can also use the rmnorm() function (proposed here already) to build this same density. This can be done as following:
xy <- rbind(4/11*rmnorm(200,c(-2,2), diag(2)),
4/11*rmnorm(200,c(2,-2), diag(2)),
3/11*rmnorm(200, c(0,0),matrix(c(0.8,-0.72,-0.72,0.8),2,2))
)
plot(xy)
I want to perform variable clustering using the varclus() function from Hmisc package.
However I do not know how to put clusters of variables into a table if I cut the dendrogram into 10 clusters of variable.
I used to use
groups <- cutree(hclust(d), k=10)
to cut dendrograms of individuals but it doesn't work for variables.
Expanding on #Anatoliy's comment, you indeed use the same cutree() function
as before because the clustering done in varclus() is actually done by the hclust() function.
When you use varclus() you're creating an object of class varclus that contains a hclust object - which can be referenced by using $hclust.
Example:
x <- varclus(d)
x_hclust <- x$hclust ## retrieve hclust object
groups <- cutree(x_hclust, 10)