How to create this multiple scatter plots in mclust package in r? - r

I use MCLUST and specifically specify K=3 clusters, with the covariance matrix type is VII.
library(mclust)
mc <- Mclust(iris[,-5], G = 2)
How to create a figure like below? It's from my textbook: Applied Multivariate Statistical Analysis by Johnson and Wichern. Notice that this figure has 2 clusters (squares and triangles) in each figure. So the textbook has a mistake here. The textbook used 2 clusters.

If you would like to modify the shape based on cluster assignment, you can do so through the use of pch. Using your data:
pairs(mc$data, pch = mc$classification)
If you want to change the shapes, you can map the classification assignment to the desired shape.

Related

Discrepancy in results when using k-means and plotting the distance matrix. Why?

I am doing cluster of some data in R Studio. I am having a problem with results of K-means Cluster Analysis and plotting Hierarchical Clustering. So when I use function kmeans, I get 4 groups with 10, 20, 30 and 6 observations. Nevertheless, when I plot the dendogram, I get 4 groups but with different numbers of observations: 23, 26, 10 and 7.
Have you ever found a problem like this?
Here you are my code:
mydata<-scale(mydata0)
# K-Means Cluster Analysis
fit <- kmeans(mydata, 4) # 4 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydatafinal <- data.frame(mydata, fit$cluster)
fit$size
[1] 10 20 30 6
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit2 <- hclust(d, method="ward.D2")
plot(fit2,cex=0.4) # display dendogram
groups <- cutree(fit2, k=4) # cut tree into 4 clusters
# draw dendogram with red borders around the 4 clusters
rect.hclust(fit2, k=4, border="red")
Results of k-means and hierarchical clustering do not need to be the same in every scenario.
Just to give an example, everytime you run k-means the initial choice of the centroids is different and so results are different.
This is not surprising. K-means clustering is initialised at random and can give distinct answers. Typically one tends to do several runs and then aggregate the results to check which are the 'core' clusters.
Hierarchical clustering is, in contrast, purely deterministic as there is no randomness involved. But like K-means, it is a heuristic: a set of rules is followed to create clusters with no regard to any underlying objective function (for example the intra- and inter- cluster variance vs overall variance). The way existing clusters are aggregated to individual observations is crucial in determining the size of the formed clusters (the "ward.D2" parameter you pass as method in the hclust command).
Having a properly defined objective function to optimise should give you a unique answer (or set thereof) but the problem is NP-hard, because of the sheer size (as a function of the number of observations) of the partitioning involved. This is why only heuristics exist and also why any clustering procedure should not be seen as a tool giving definitive answers but as an exploratory one.

Silhouette plot in R

I have a set of data containing:
item, associated cluster, silhouette coefficient. I can further augment this data set with more information if necessary.
I would like to generate a silhouette plot in R. I am having trouble with this because examples I came across use the built-in kmeans (or related) clustering function and plot the result. I want to bypass this step and produce the plot for my own clustering algorithm but I'm ending up short on providing the correct arguments to the plot function.
Thank you.
EDIT
Data set example https://pastebin.mozilla.org/8853427
What I've tried is loading the dataset and passing it to the plot function using various arguments based on https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/silhouette.html
Function silhouette in package cluster can do the plots for you. It just needs a vector of cluster membership (produced from whatever algorithm you choose) and a dissimilarity matrix (probably best to use the same one used in producing the clusters). For example:
library (cluster)
library (vegan)
data(varespec)
dis = vegdist(varespec)
res = pam(dis,3) # or whatever your choice of clustering algorithm is
sil = silhouette (res$clustering,dis) # or use your cluster vector
windows() # RStudio sometimes does not display silhouette plots correctly
plot(sil)
EDIT: For k-means (which uses squared Euclidean distance)
library (vegan)
library (cluster)
data(varespec)
dis = dist(varespec)^2
res = kmeans(varespec,3)
sil = silhouette (res$cluster, dis)
windows()
plot(sil)

Differing color of points & line in plot.gamm()

I ran a GAMM model with a large dataset (over 20,000 cases) using mgcv. Because of the large number of data points, it is very difficult to see the smoothed lines among the residual points in the plot. Is it possible to specify different colors for the points and the smoothed fit lines?
Here is an example adopted from the mgcv documentation:
library(mgcv)
## simple examples using gamm as alternative to gam
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
b <- gamm(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
plot(b$gam, pages=1, residuals=T, col='#FF8000', shade=T, shade.col='gray90')
plot(absbmidiffLog.GAMM$gam, pages=1, residuals=T, pch=19, cex=0.01, scheme=1,
col='#FF8000', shade=T,shade.col='gray90')
I have looked into the visreg package, but it does not seem to work with gamma objects.
I also found it surprisingly difficult/impossible to choose 2 different colors.
A workaround for me is to add the points later, i.e first call plot.gam with residuals=FALSE and then add the points with the base R points() call.
This only works properly though if you shift the gam plot to its proper mean. Here is the code for one of the terms. (Use a for loop to get all four on one page)
library(mgcv)
## simple examples using gamm as alternative to gam
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
b <- gamm(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
plot(b$gam, select=3, shift = coef(b$gam)[1], residuals=FALSE, col='#FF8000', shade=T, shade.col='gray90')
points(y~x3, data=dat,pch=20,cex=0.75,col=rgb(1,0.65,0,0.25))

NMDS ordination interpretation from R output

I have conducted an NMDS analysis and have plotted the output too. However, I am unsure how to actually report the results from R. Which parts from the following output are of most importance? The graph that is produced also shows two clear groups, how are you supposed to describe these results?
MDS.out
Call:
metaMDS(comm = dgge2, distance = "bray")
global Multidimensional Scaling using monoMDS
Data: dgge2
Distance: bray
Dimensions: 2
Stress: 0
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘dgge2’
The most important pieces of information are that stress=0 which means the fit is complete and there is still no convergence. This happens if you have six or fewer observations for two dimensions, or you have degenerate data. You should not use NMDS in these cases. Current versions of vegan will issue a warning with near zero stress. Perhaps you had an outdated version.
I think the best interpretation is just a plot of principal component. yOu can use plot and text provided by vegan package. Here I am creating a ggplot2 version( to get the legend gracefully):
library(vegan)
library(ggplot2)
data(dune)
ord = metaMDS(comm = dune)
ord_spec <- scores(ord, "spec")
ord_spec <- cbind.data.frame(ord_spec,label=rownames(ord_spec))
ord_sites <- scores(ord, "sites")
ord_sites <- cbind.data.frame(ord_sites,label=rownames(ord_sites))
ggplot(data=ord_spec,aes(x=NMDS1,y=NMDS2)) +
geom_text(aes(label=label,col='species')) +
geom_text(data=ord_sites,aes(label=label,col='sites'))

Heatmap or plot for a correlation matrix [duplicate]

This question already has answers here:
Plot correlation matrix into a graph
(13 answers)
Closed 9 years ago.
I tried to make a plot out of the correlation matrix and having three colours to represent the correlation coefficients using the library lattice.
library(lattice)
levelplot(cor)
I obtain the following plot:
The plot is only for a subset of the data I had. When I use the whole dataset( 400X400) then it becomes unclear and the colouring is not shown properly and is shown as dots. Is it possible to obtain the same in tile form for a large matrix?
I tried using the pheatmap function but I do not want my values to be clustered and just want a representaion of high and low values clearly in a tile form.
If you want to do a correlation plot, use the corrplot library as it has a lot of flexibility to create heatmap-like figures for correlations
library(corrplot)
#create data with some correlation structure
jnk=runif(1000)
jnk=(jnk*100)+c(1:500, 500:1)
jnk=matrix(jnk,nrow=100,ncol=10)
jnk=as.data.frame(jnk)
names(jnk)=c("var1", "var2","var3","var4","var5","var6","var7","var8","var9","var10")
#create correlation matrix
cor_jnk=cor(jnk, use="complete.obs")
#plot cor matrix
corrplot(cor_jnk, order="AOE", method="circle", tl.pos="lt", type="upper",
tl.col="black", tl.cex=0.6, tl.srt=45,
addCoef.col="black", addCoefasPercent = TRUE,
p.mat = 1-abs(cor_jnk), sig.level=0.50, insig = "blank")
The code above only adds color to the correlations that have > abs(0.5) correlation, but you can easily change that. Lastly, there are many ways that you can configure the look of the plot as well (change the color gradient, display of correlations, display of full vs only half matrix, etc.). The order argument is particularly useful as it allows you to order your variables in the correlation matrix based on PCA, so they are ordered based on similarities in correlation.
For squares for instance (similar to your original plot)- just change the method to squares:
EDIT: #Carson. You can still use this method for reasonable large correlation matrices: for instance a 100 variable matrix below. Beyond that, I fail to see what is the use of making a graphical representation of a correlation matrix with so many variables without some subsetting, as that will be very hard to interpret.
#Lucas provides good advice here as corrplot is quite useful for visualizing correlation matrices. However, it doesn't address the original issue of plotting a large correlation matrix. In fact, corrplot will also fail when trying to visualize this large of a correlation matrix. For a simple solution, you might want to consider reducing the number of variables. That is, I would suggest looking at the correlation between a subset of variables that you know are important for your problem. Trying to understand the correlation structure of that many variables will be a difficult task (even if you can visualize it)!

Resources