First off, I apologize if this is a stupid question. I am a medical doctor and never studied mathematics/statistics, so all I know is self-taught over the course of my PhD.
To analyze my data I'm using a daisy() produced dissimilarity matrix as source for kmeans() clustering and visualising the result with clusplot(). The clusplot provides information about how much point variability the two visualised components provide taken together, but not separately. Is there a way to show % of point variability for each component separately?
Do I undestand correctly, that the daisy output does not have principal components per se, but the clusplot pretty much runs PCA on it and uses the first two components? If so, can it list all the components with % of explained variability?
Thanks a lot!
Example code:
gower_dist <- daisy(data, metric = "gower")
fit <- kmeans(gower_dist, 3, nstart = 20)
attr(gower_dist,"Labels")=data[,5]
clusplot(as.matrix(gower_dist), diss=TRUE, fit$cluster, color=TRUE, shade=FALSE, labels=3, lines=0, plotchar=FALSE, stand=TRUE, span=TRUE)
Related
I am working with R.
I am calculating a hierarchical cluster and plotting it. I then cut it into cluster-groups to plot again.
I have a for-loop, to do this on subsets of a database, which works fine.
The problem is, each subset of data might have a different optimal number of clusters...
The solutions ive found online, to find the optimal amount of clusters, is visual.
Is there code I can run to automiatically configure the optimal number of clusters? In the code example, im looking for "noOfClusters". Also, it should be a maximum of 10...
This is how my clustering looks like, in short:
clusterResult <- agnes(singleLinkMatrix,stand = FALSE, method = "ward", metric = "euclidean")
plot(clusterResult)
clusterMember <- cutree(clusterResult, k = noOfClusters)
Thanks a lot :)
My problem may seem trivial to most of you. I'm working on hierarchical clustering using warde method with my data and would I like to identify the optimal number of clusters. This is the plot that shows hierarchical clustering from an optimal matching distance. But what is the optimal number of clusters in this case? How can I determine this?
Sample code:
costs <- seqcost(df_new.seq, method="TRATE")
df_new.seq.om<- seqdist(df_new.seq, method="OM", sm=costs$sm, indel=costs$indel)
######################### cluster ward ###########################
clusterward <- agnes(df_new.seq.om, diss = TRUE, method = "ward")
dev.new()
plot(clusterward, which.plots = 2)
cl1.4 <- cutree(clusterward, k = 10)
cl1.4fac <- factor(cl1.4, labels = paste("Cluster", 1:10))
While this question is over a year old at this point and the poster have hopefully decided on their clusters, for anyone finding this post and wondering the same thing: How do I best decide on the optimal number of clusters when doing sequence analysis, I highly recommend this paper on cluster validation. I've found it very useful! It comes with a step by step example.
Studer, M. (2021). Validating Sequence Analysis Typologies Using Parametric Bootstrap. Sociological Methodology, 51(2), 290–318. https://doi-org.proxy.ub.umu.se/10.1177/00811750211014232
I am applying the functions from the flexclust package for hard competitive learning clustering, and I am having trouble with the convergence.
I am using this algorithm because I was looking for a method to perform a weighed clustering, giving different weights to groups of variables. I chose hard competitive learning based on a response for a previous question (Weighted Kmeans R).
I am trying to find the optimal number of clusters, and to do so I am using the function stepFlexclust with the following code:
new("flexclustControl") ## check the default values
fc_control <- new("flexclustControl")
fc_control#iter.max <- 500 ### 500 iterations
fc_control#verbose <- 1 # this will set the verbose to TRUE
fc_control#tolerance <- 0.01
### I want to give more weight to the first 24 variables of the dataframe
my_weights <- rep(c(1, 0.064), c(24, 31))
set.seed(1908)
hardcl <- stepFlexclust(x=df, k=c(7:20), nrep=100, verbose=TRUE,
FUN = cclust, dist = "euclidean", method = "hardcl", weights=my_weights, #Parameters for hard competitive learning
control = fc_control,
multicore=TRUE)
However, the algorithm does not converge, even with 500 iterations. I would appreciate any suggestion. Should I increase the number of iterations? Is this an indicator that something else is not going well, or did I a mistake with the R commands?
Thanks in advance.
Two things that answer my question (as well as a comment on weighted variables for kmeans, or better said, with hard competitive learning):
The weights are for observations (=rows of x), not variables (=columns of x). so using hardcl for weighting variables is wrong.
In hardcl or neural gas you need much more iterations compared to standard k-means: In k-means one iteration uses the complete data set to change the centroids, hard competitive learning and uses only a single observation. In comparison to k-means multiply the number of iterations by your sample size.
I have conducted an NMDS analysis and have plotted the output too. However, I am unsure how to actually report the results from R. Which parts from the following output are of most importance? The graph that is produced also shows two clear groups, how are you supposed to describe these results?
MDS.out
Call:
metaMDS(comm = dgge2, distance = "bray")
global Multidimensional Scaling using monoMDS
Data: dgge2
Distance: bray
Dimensions: 2
Stress: 0
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘dgge2’
The most important pieces of information are that stress=0 which means the fit is complete and there is still no convergence. This happens if you have six or fewer observations for two dimensions, or you have degenerate data. You should not use NMDS in these cases. Current versions of vegan will issue a warning with near zero stress. Perhaps you had an outdated version.
I think the best interpretation is just a plot of principal component. yOu can use plot and text provided by vegan package. Here I am creating a ggplot2 version( to get the legend gracefully):
library(vegan)
library(ggplot2)
data(dune)
ord = metaMDS(comm = dune)
ord_spec <- scores(ord, "spec")
ord_spec <- cbind.data.frame(ord_spec,label=rownames(ord_spec))
ord_sites <- scores(ord, "sites")
ord_sites <- cbind.data.frame(ord_sites,label=rownames(ord_sites))
ggplot(data=ord_spec,aes(x=NMDS1,y=NMDS2)) +
geom_text(aes(label=label,col='species')) +
geom_text(data=ord_sites,aes(label=label,col='sites'))
I have 150 Experimental substances. 80 characteristics were measured for each of these substances separately. I applied PCA to compute its PCs and determined first three components.Now, I want to apply k-means clustering in R. software (www.R-project.org) with 1000 iterations on low-dimensional data to separate the individuals to their respective populations.
Can anyone see how this can be done? thanks
See adegenet package and try DAPC.
Please, read http://bmcgenet.biomedcentral.com/articles/10.1186/1471-2156-11-94 I think it does what You wish. It is implemented in adegenet R package as DAPC. This implementation is designed for multi locus genotype data, but the principle is very well described, so that You can modify it for Your own data or find something similar.
It performs K-means clustering on PC-transformed ("cleared") data, which significantly speeds up whole calculations. Finally it performs discriminant analysis to get the best clustering. It is very efficient method.
http://www.statmethods.net/advstats/cluster.html Provides nice and easy examples to cluster data.
For your question:
Consider some random normal data and some simple code to fit Kmeans clustering. Note, 3 clusters will be fit to this data (purely arbitrarily).
data = matrix(rnorm(450),ncol=3)
fit = kmeans(data, centers = 3, iter.max = 1000)
cluster.data = data.frame(data, fit$cluster)
Has this answered your question?