K-centers clustering using R - is the resulting plot off? - r

I am trying to do k-means clustering using R, and this is what I have done so far:
tmp <- kmeans(ds, centers = 4, iter.max = 1000)
plot(ds[tmp$cluster==1,c(1,5)], col = "red", xlim = c(min(ds[,1]),
max(ds[,1])), ylim = c(min(ds[,5]), max(ds[,5])))
points(ds[tmp$cluster==2,c(1,5)], col = "blue")
points(ds[tmp$cluster==3,c(1,5)], col = "seagreen")
points(ds[tmp$cluster==4,c(1,5)], col = "orange")
points(tmp$centers[,c(1,5)], col = "black")
and I get the following graph:
I am quite new to this, so I may be way off, but this graph does not look quite right to me. The data is basically divided in zones and to be honest, I was expecting to see something along the lines of this:
The circles in this picture are just to showcase where I was expecting the clusters to be. Can anyone explain why the data is clustered like that? I did the clustering multiple times and I always end up with this result.
The dataset I am using can be found here.

Notice that Age runs from about 18 to 60, so the maximum distance between age is about 40. Now notice that the incomes range from 0 to 20000. The distance between points is heavily dominated by the income. If you wish both variables to be used in the clustering, you should scale the data before clustering. Try
tmp<-kmeans(scale(ds), centers = 4, iter.max = 1000)

This is how the k-means clustering algorithm work. Google "k-means clustering" and look at the picture results and you will see different variations: circular clusters and the type you received. If you set number of clusters k to a different number, you will get different clusters. The goal of the algorithm is to partition a data set into a desired number of non-overlapping clusters k, so that the total within-cluster variation is minimized. And this is the result you see in your plot.

Related

How to line (cut) a dendrogram at the best K

How do I draw a line in a dendrogram that corresponds the best K for a given criteria?
Like this:
Lets suppose that this is my dendrogram, and the best K is 4.
data("mtcars")
myDend <- as.dendrogram(hclust(dist(mtcars)))
plot(myDend)
I know that abline function is able to draw lines in graphs similarly to the one showed above. However, I don't know how could I calculate the height, so the function is used as abline(h = myHeight)
The information that you need to get the heights came with hclust. It has a variable containing the heights. To get the 4 clusters, you want to draw your line between the 3rd biggest and 4th biggest height.
HC = hclust(dist(mtcars))
myDend <- as.dendrogram(HC)
par(mar=c(7.5,4,2,2))
plot(myDend)
k = 4
n = nrow(mtcars)
MidPoint = (HC$height[n-k] + HC$height[n-k+1]) / 2
abline(h = MidPoint, lty=2)

Plotting hclust only to the cut clusters, not every leaf

I have an hclust tree with nearly 2000 samples. I have cut it to an appropriate number of clusters and would like to plot the dendrogram but ending at the height that I cut the clusters rather than all the way to every individual leaf. Every plotting guide is about coloring all the leaves by cluster or drawing a box, but nothing seems to just leave the leaves below the cut line out completely.
My full dendrogram looks like the following:
I would like to plot it as if it stops where I've drawn the abline here (for example):
This should get you started. I suggest reading the help page for "dendrogram"
Here is the example from the help page:
hc <- hclust(dist(USArrests))
dend1 <- as.dendrogram(hc)
plot(dend1)
dend2 <- cut(dend1, h = 100)
plot(dend2$upper)
plot(dend2$upper, nodePar = list(pch = c(1,7), col = 2:1))
By performing the cut on the dendrogram object (not the hclust object) you can then plot the upper part of the dendrogram. It will take a some work to replace the branch1, 2, 3, and 4 labels depending on your analysis.
Good luck.

Smoothing using kernel and loess in R

I am trying to smooth my data set, using kernel or loess smoothing method. But, They are all not clear or not what I want. Several questions are the followings.
My x data is "conc" and y data is "depth", which is ex. cm.
1) Kernel smooth
k <- kernel("daniell", 150)
plot(k)
K <- kernapply(conc, k)
plot(conc~depth)
lines(K, col = "red")
Here, my data is smoothed by frequency=150. This means that every data point is averaged by neighboring (right and left) 150 data points? What "daniell" means? I could not find what it means online.
2) Loess smooth
p<-qplot(depth, conc, data=total)
p1 <- p + geom_smooth(method = "loess", size = 1, level=0.95)
Here, what is the default of loess smooth function? If I want to smooth my data with frequency=150 like above case (moving average by every 150 data point), how can I modify this code?
3) To show y-axis with log scale, I put "log10(conc)", instead of "conc", and it worked. But, I cannot change the y-axis tick label. I tried to use "scale_y_log10(limits = c(1,1e3))" in my code to show axis tick labe like 10^0, 10^1, 10^2..., but did not work.
Please answer my questions. Thanks a lot for your help.
Sum

Why am I not getting points around clusers in this kmeans implementation?

In below kmeans analysis I am assigning a 1 or 0 to indicate if word is associated with a user :
cells = c(1,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,1,1,1,1)
rnames = c("a1","a2","a3","a4","a5","a6","a7","a8","a9")
cnames = c("google","so","test")
x <- matrix(cells, nrow=9, ncol=3, byrow=TRUE, dimnames=list(rnames, cnames))
# run K-Means
km <- kmeans(x, 3, 15)
# print components of km
print(km)
# plot clusters
plot(x, col = km$cluster)
# plot centers
points(km$centers, col = 1:2, pch = 8)
This is the graph :
Why do I not receive multiple points around each cluster ? What is this graph indicating. I would like to suggest a word to a user depending on if another use has the same word configured.
You don't see multiple points because your data are discrete, categorical observations. K-means is really only suitable for grouping continuous observations. Your data can only appear on three points on the plot you've shown and three points don't make a nice "cloud" of data.
This suggests to me that k-means is probably not appropriate for your specific problem.
Incidentally, when I run the code above, I get the plot below, which is different from the one you've shown us. Perhaps this is more like what you are expecting? The green green data point belongs to (is "around") the upper-right cluster centre indicated by a black asterisk.

How to create a cluster plot in R?

How can I create a cluster plot in R without using clustplot?
I am trying to get to grips with some clustering (using R) and visualisation (using HTML5 Canvas).
Basically, I want to create a cluster plot but instead of plotting the data, I want to get a set of 2D points or coordinates that I can pull into canvas and do something might pretty with (but I am unsure of how to do this). I would imagine that I:
Create a similarity matrix for the entire dataset (using dist)
Cluster the similarity matrix using kmeans or something similar (using kmeans)
Plot the result using MDS or PCA - but I am unsure of how steps 2 and 3 relate (cmdscale).
I've checked out questions here, here and here (with the last one being of most use).
Did you mean something like this?
Sorry but i know nothing about HTML5 Canvas, only R... But I hope it helps...
First I cluster the data using kmeans (note that I did not cluster the distance matrix), than I compute the distance matix and plot it using cmdscale. Then I add colors to the MDS-plot that correspond to the groups identified by kmeans. Plus some nice additional graphical features.
You can access the coordinates from the object created by cmdscale.
### some sample data
require(vegan)
data(dune)
# kmeans
kclus <- kmeans(dune,centers= 4, iter.max=1000, nstart=10000)
# distance matrix
dune_dist <- dist(dune)
# Multidimensional scaling
cmd <- cmdscale(dune_dist)
# plot MDS, with colors by groups from kmeans
groups <- levels(factor(kclus$cluster))
ordiplot(cmd, type = "n")
cols <- c("steelblue", "darkred", "darkgreen", "pink")
for(i in seq_along(groups)){
points(cmd[factor(kclus$cluster) == groups[i], ], col = cols[i], pch = 16)
}
# add spider and hull
ordispider(cmd, factor(kclus$cluster), label = TRUE)
ordihull(cmd, factor(kclus$cluster), lty = "dotted")
Here you can find one graph to analyze cluster results, "coordinate plot", within "clusplot" package.
It is not based on PCA. It uses function scale to have all the variables means in a range of 0 to 1, so you can compare which cluster holds the max/min average for each variable.
install.packages("devtools") ## To be able to download packages from github
library(devtools)
install_github("pablo14/clusplus")
library(clusplus)
## Create k-means model with 3 clusters
fit_mtcars=kmeans(mtcars,3)
## Call the function
plot_clus_coord(fit_mtcars, mtcars)
This post explains how to use it.

Resources