I am trying to do k-means clustering using R, and this is what I have done so far:
tmp <- kmeans(ds, centers = 4, iter.max = 1000)
plot(ds[tmp$cluster==1,c(1,5)], col = "red", xlim = c(min(ds[,1]),
max(ds[,1])), ylim = c(min(ds[,5]), max(ds[,5])))
points(ds[tmp$cluster==2,c(1,5)], col = "blue")
points(ds[tmp$cluster==3,c(1,5)], col = "seagreen")
points(ds[tmp$cluster==4,c(1,5)], col = "orange")
points(tmp$centers[,c(1,5)], col = "black")
and I get the following graph:
I am quite new to this, so I may be way off, but this graph does not look quite right to me. The data is basically divided in zones and to be honest, I was expecting to see something along the lines of this:
The circles in this picture are just to showcase where I was expecting the clusters to be. Can anyone explain why the data is clustered like that? I did the clustering multiple times and I always end up with this result.
The dataset I am using can be found here.
Notice that Age runs from about 18 to 60, so the maximum distance between age is about 40. Now notice that the incomes range from 0 to 20000. The distance between points is heavily dominated by the income. If you wish both variables to be used in the clustering, you should scale the data before clustering. Try
tmp<-kmeans(scale(ds), centers = 4, iter.max = 1000)
This is how the k-means clustering algorithm work. Google "k-means clustering" and look at the picture results and you will see different variations: circular clusters and the type you received. If you set number of clusters k to a different number, you will get different clusters. The goal of the algorithm is to partition a data set into a desired number of non-overlapping clusters k, so that the total within-cluster variation is minimized. And this is the result you see in your plot.
I have an hclust tree with nearly 2000 samples. I have cut it to an appropriate number of clusters and would like to plot the dendrogram but ending at the height that I cut the clusters rather than all the way to every individual leaf. Every plotting guide is about coloring all the leaves by cluster or drawing a box, but nothing seems to just leave the leaves below the cut line out completely.
My full dendrogram looks like the following:
I would like to plot it as if it stops where I've drawn the abline here (for example):
This should get you started. I suggest reading the help page for "dendrogram"
Here is the example from the help page:
hc <- hclust(dist(USArrests))
dend1 <- as.dendrogram(hc)
plot(dend1)
dend2 <- cut(dend1, h = 100)
plot(dend2$upper)
plot(dend2$upper, nodePar = list(pch = c(1,7), col = 2:1))
By performing the cut on the dendrogram object (not the hclust object) you can then plot the upper part of the dendrogram. It will take a some work to replace the branch1, 2, 3, and 4 labels depending on your analysis.
Good luck.
I would like to have visualization of hierarchical clustering with shapes one inside the other. Brightness level represents level of hierarchy.
Let me show you my idea with an example:
# Clustering small proportion of iris data
clusters <- hclust(dist(iris[20:28, 3:4]), method = 'average')
# Visualizing the result as a dendogram
plot(clusters)
Now we can convert the dendrogram as below.
Is there any R package that can produce something similar?
This is only a partial answer. You can use clusplot from the cluster package to get some way in that direction. You could probably improve on this by changing the source of clusplot (type getAnywhere(clusplot.default) to get the source). But it is probably some work to get your bubbles to not overlap. Anyway, here's the plot you get from clusplot. It may also be of interest to look at the individual plots one at a time instead of showing them all together.
# use sample data
df <- iris[20:28, 3:4]
# calculate hierarchical clustering
hfit <- hclust(dist(df), method = 'average')
# plot dendogram
plot(hfit)
# use clusplot at all possible cutoffs and show on top of each other.
library(cluster)
clusplot(df, cutree(hfit, 1), lines = 0)
for (i in 2:nrow(df)){
clusplot(df, cutree(hfit, i), lines = 0, add = TRUE)
}
I've run a 2d simulation in some modelling software from which i've got an export of x,y point locations with a set of 6 attributes. I wish to recreate a figure that combines the data, like this:
The ellipses and the background are shaded according to attribute 1 (and the borders of these are of course representing the model geometry, but I don't think I can replicate that), the isolines are contours of attribute 2, and the arrow glyphs are from attributes 3 (x magnitude) and 4 (y magnitude).
The x,y points are centres of the triangulated mesh I think, and look like this:
I want to know how I can recreate a plot like this with R. To start with I have irregularly-spaced data due to it being exported from an irregular mesh. That's immediately where I get stuck with R, having only ever used it for producing box-and-whisper plots and the like.
Here's the data:
https://dl.dropbox.com/u/22417033/Ellipses_noheader.txt
Edit: fields: x, y, heat flux (x), heat flux (y), thermal conductivity, Temperature, gradT (x), gradT (y).
names(Ellipses) <- c('x','y','dfluxx','dfluxy','kxx','Temps','gradTx','gradTy')
It's quite easy to make the lower plot (making the assumption that there is a dataframe named 'edat' read in with:
edat <- read.table(file=file.choose())
with(edat, plot(V1,V2), cex=0.2)
Things get a bit more beautiful with:
with(edat, plot(V1,V2, cex=0.2, col=V5))
So I do not think your original is being faithfully represented by the data. The contour lines are NOT straight across the "conductors". I call them "conductors" because this looks somewhat like iso-potential lines in electrostatics. I'm adding some text here to serve as a search handle for others who might be searching for plotting problems in real world physics: vector-field (the arrows) , heat equations, gradient, potential lines.
You can then overlay the vector field with:
with(edat, arrows(V1,V2, V1-20*V6*V7, V2-20*V6*V8, length=0.04, col="orange") )
You could"zoom in" with xlim and ylim:
with(edat, plot(V1,V2, cex=0.3, col=V5, xlim=c(0, 10000), ylim=c(-8000, -2000) ))
with(edat, arrows(V1,V2, V1-20*V6*V7, V2-20*V6*V8, length=0.04, col="orange") )
Guessing that the contour requested if for the Temps variable. Take your pick of contourplots.
require(akima)
intflow<- with(edat, interp(x=x, y=y, z=Temps, xo=seq(min(x), max(x), length = 410),
yo=seq(min(y), max(y), length = 410), duplicate="mean", linear=FALSE) )
require(lattice)
contourplot(intflow$z)
filled.contour(intflow)
with( intflow, contour(x=x, y=y, z=z) )
The last one will mix with the other plotting examples since those were using base plotting functions. You may need to switch to points instead of plot.
There are several parts to your plot so you will probably need several tools to make the different parts.
The background and ellipses can be created with polygon (once you figure where they should be).
The contourLines function can calculate the contour lines for you which you can add with the lines function (or contour has and add argument and could probably be used to add the lines directly).
The akima package has a function interp which can estimate values on a grid given the values ungridded.
The my.symbols function along with ms.arrows, both from the TeachingDemos package, can be used to draw the vector field.
#DWin is right to say that your graph don't represent faithfully your data, so I would advice to follow his answer. However here is how to reproduce (the closest I could) your graph:
Ellipses <- read.table(file.choose())
names(Ellipses) <- c('x','y','dfluxx','dfluxy','kxx','Temps','gradTx','gradTy')
require(splancs)
require(akima)
First preparing the data:
#First the background layer (the 'kxx' layer):
# Here the regular grid on which we're gonna do the interpolation
E.grid <- with(Ellipses,
expand.grid(seq(min(x),max(x),length=200),
seq(min(y),max(y),length=200)))
names(E.grid) <- c("x","y") # Without this step, function inout throws an error
E.grid$Value <- rep(0,nrow(E.grid))
#Split the dataset according to unique values of kxx
E.k <- split(Ellipses,Ellipses$kxx)
# Find the convex hull delimiting each of those values domain
E.k.ch <- lapply(E.k,function(X){X[chull(X$x,X$y),]})
for(i in unique(Ellipses$kxx)){ # Pick the value for each coordinate in our regular grid
E.grid$Value[inout(E.grid[,1:2],E.k.ch[names(E.k.ch)==i][[1]],bound=TRUE)]<-i
}
# Then the regular grid for the second layer (Temp)
T.grid <- with(Ellipses,
interp(x,y,Temps, xo=seq(min(x),max(x),length=200),
yo=seq(min(y),max(y),length=200),
duplicate="mean", linear=FALSE))
# The regular grids for the arrow layer (gradT)
dx <- with(Ellipses,
interp(x,y,gradTx,xo=seq(min(x),max(x),length=15),
yo=seq(min(y),max(y),length=10),
duplicate="mean", linear=FALSE))
dy <- with(Ellipses,
interp(x,y,gradTy,xo=seq(min(x),max(x),length=15),
yo=seq(min(y),max(y),length=10),
duplicate="mean", linear=FALSE))
T.grid2 <- with(Ellipses,
interp(x,y,Temps, xo=seq(min(x),max(x),length=15),
yo=seq(min(y),max(y),length=10),
duplicate="mean", linear=FALSE))
gradTgrid<-expand.grid(dx$x,dx$y)
And then the plotting:
palette(grey(seq(0.5,0.9,length=5)))
par(mar=rep(0,4))
plot(E.grid$x, E.grid$y, col=E.grid$Value,
axes=F, xaxs="i", yaxs="i", pch=19)
contour(T.grid, add=TRUE, col=colorRampPalette(c("blue","red"))(15), drawlabels=FALSE)
arrows(gradTgrid[,1], gradTgrid[,2], # Here I multiply the values so you can see them
gradTgrid[,1]-dx$z*40*T.grid2$z, gradTgrid[,2]-dy$z*40*T.grid2$z,
col="yellow", length=0.05)
To understand in details how this code works, I advise you to read the following help pages: ?inout, ?chull, ?interp, ?expand.grid and ?contour.
How can I create a cluster plot in R without using clustplot?
I am trying to get to grips with some clustering (using R) and visualisation (using HTML5 Canvas).
Basically, I want to create a cluster plot but instead of plotting the data, I want to get a set of 2D points or coordinates that I can pull into canvas and do something might pretty with (but I am unsure of how to do this). I would imagine that I:
Create a similarity matrix for the entire dataset (using dist)
Cluster the similarity matrix using kmeans or something similar (using kmeans)
Plot the result using MDS or PCA - but I am unsure of how steps 2 and 3 relate (cmdscale).
I've checked out questions here, here and here (with the last one being of most use).
Did you mean something like this?
Sorry but i know nothing about HTML5 Canvas, only R... But I hope it helps...
First I cluster the data using kmeans (note that I did not cluster the distance matrix), than I compute the distance matix and plot it using cmdscale. Then I add colors to the MDS-plot that correspond to the groups identified by kmeans. Plus some nice additional graphical features.
You can access the coordinates from the object created by cmdscale.
### some sample data
require(vegan)
data(dune)
# kmeans
kclus <- kmeans(dune,centers= 4, iter.max=1000, nstart=10000)
# distance matrix
dune_dist <- dist(dune)
# Multidimensional scaling
cmd <- cmdscale(dune_dist)
# plot MDS, with colors by groups from kmeans
groups <- levels(factor(kclus$cluster))
ordiplot(cmd, type = "n")
cols <- c("steelblue", "darkred", "darkgreen", "pink")
for(i in seq_along(groups)){
points(cmd[factor(kclus$cluster) == groups[i], ], col = cols[i], pch = 16)
}
# add spider and hull
ordispider(cmd, factor(kclus$cluster), label = TRUE)
ordihull(cmd, factor(kclus$cluster), lty = "dotted")
Here you can find one graph to analyze cluster results, "coordinate plot", within "clusplot" package.
It is not based on PCA. It uses function scale to have all the variables means in a range of 0 to 1, so you can compare which cluster holds the max/min average for each variable.
install.packages("devtools") ## To be able to download packages from github
library(devtools)
install_github("pablo14/clusplus")
library(clusplus)
## Create k-means model with 3 clusters
fit_mtcars=kmeans(mtcars,3)
## Call the function
plot_clus_coord(fit_mtcars, mtcars)
This post explains how to use it.