clustered scatterplot in R - r

I'm new to R, I searched but I find outdate info only.
I've done a simple single linkage clustering process.
d<-dist(scale(DATA),method="euclidean",diag=TRUE,upper=TRUE)
hls<-hclust(d,method="complete")
How can I plot a scatterplot which uses a color each cluster?
Exactly like this example

I created some sample data to work with. If your data looks different, please provide some sample data as part of your question.
To create a scatter plot colored by group, first create your groups using the cutree function. You can specify an integer value to indicate how may groups you want to create.
Next use your favorite graphing package (e.g. ggplot) to create the scatter plot.
# Sample data
rData <- data.frame(x=c(1,1,3,4), y=c(1,2,5,4))
print(rData)
# Cluster
d <- dist(scale(rData), method="euclidean", diag=TRUE, upper=TRUE)
hls <- hclust(d, method="complete")
# Create groups
cluster <- cutree(hls, 2)
# Create scatter plot
ggData <- cbind(rData, cluster)
ggData$cluster <- as.factor(ggData$cluster)
print(ggData)
ggplot(ggData, aes(x=x, y=y, color=cluster)) + geom_point(size=5)
I would recommend exploring http://www.cookbook-r.com/Graphs/ to learn more about ggplot.

Related

How to add the original data to a contour plot without external libraries in R?

I have a standard unfilled contour plot in R. It has two regions and was generated using the KDE function. It looks to be normalised to between 0 and 1. I want to plot the original data over it however R just seems to plot the data on a separate graph each time. I have tried using lines() and points(). So my two questions are: 1) how do you un-normalise a contour plot (did KDE normalise the output?) and 2) how do you plot the original data over a contour plot?
Skeleton code:
data.kde <- kde(data)
plot(data)
contour(data.kde$estimate, add=TRUE)
I am not sure if the add=TRUE statement is working, as the data is on different scales as my contour plot has come out normalised to between 0 and 1. If I normalise my original data it does not quite match where it should on the contour - the two data centres are slightly off from the contour centres.
Suppose your data is like this:
library(ks)
set.seed(1)
x <- rnorm(100)
y <- rnorm(100)
data <- cbind(x, y)
Then you can do:
KDE <- kde(data)
plot(KDE, drawpoints = TRUE)
Or if you want to use contour
contour(x = KDE$eval.points[[1]], y = KDE$eval.points[[2]], z = KDE$estimate)
points(KDE$x[,1], KDE$x[,2])
Created on 2022-02-03 by the reprex package (v2.0.1)

ggplot2 2d Density Weights

I'm trying to plot some data with 2d density contours using ggplot2 in R.
I'm getting one slightly odd result.
First I set up my ggplot object:
p <- ggplot(data, aes(x=Distance,y=Rate, colour = Company))
I then plot this with geom_points and geom_density2d. I want geom_density2d to be weighted based on the organisation's size (OrgSize variable). However when I add OrgSize as a weighting variable nothing changes in the plot:
This:
p+geom_point()+geom_density2d()
Gives an identical plot to this:
p+geom_point()+geom_density2d(aes(weight = OrgSize))
However, if I do the same with a loess line using geom_smooth, the weighting does make a clear difference.
This:
p+geom_point()+geom_smooth()
Gives a different plot to this:
p+geom_point()+geom_smooth(aes(weight=OrgSize))
I was wondering if I'm using density2d inappropriately, should I instead be using contour and supplying OrgSize as the 'height'? If so then why does geom_density2d accept a weighting factor?
Code below:
require(ggplot2)
Company <- c("One","One","One","One","One","Two","Two","Two","Two","Two")
Store <- c(1,2,3,4,5,6,7,8,9,10)
Distance <- c(1.5,1.6,1.8,5.8,4.2,4.3,6.5,4.9,7.4,7.2)
Rate <- c(0.1,0.3,0.2,0.4,0.4,0.5,0.6,0.7,0.8,0.9)
OrgSize <- c(500,1000,200,300,1500,800,50,1000,75,800)
data <- data.frame(Company,Store,Distance,Rate,OrgSize)
p <- ggplot(data, aes(x=Distance,y=Rate))
# Difference is apparent between these two
p+geom_point()+geom_smooth()
p+geom_point()+geom_smooth(aes(weight = OrgSize))
# Difference is not apparent between these two
p+geom_point()+geom_density2d()
p+geom_point()+geom_density2d(aes(weight = OrgSize))
geom_density2d is "accepting" the weight parameter, but then not passing to MASS::kde2d, since that function has no weights. As a consequence, you will need to use a different 2d-density method.
(I realize my answer is not addressing why the help page says that geom_density2d "understands" the weight argument, but when I have tried to calculate weighted 2D-KDEs, I have needed to use other packages besides MASS. Maybe this is a TODO that #hadley put in the help page that then got overlooked?)

cluster presentation dendrogram alternative in r

I know dendrograms are quite popular. However if there are quite large number of observations and classes it hard to follow. However sometime I feel that there should be better way to present the same thing. I got an idea but do not know how to implement it.
Consider the following dendrogram.
> data(mtcars)
> plot(hclust(dist(mtcars)))
Can plot it like a scatter plot. In which the distance between two points is plotted with line, while sperate clusters (assumed threshold) are colored and circle size is determined by value of some variable.
You are describing a fairly typical way of going about cluster analysis:
Use a clustering algorithm (in this case hierarchical clustering)
Decide on the number of clusters
Project the data in a two-dimensional plane using some form or principal component analysis
The code:
hc <- hclust(dist(mtcars))
cluster <- cutree(hc, k=3)
xy <- data.frame(cmdscale(dist(mtcars)), factor(cluster))
names(xy) <- c("x", "y", "cluster")
xy$model <- rownames(xy)
library(ggplot2)
ggplot(xy, aes(x, y)) + geom_point(aes(colour=cluster), size=3)
What happens next is that you get a skilled statistician to help explain what the x and y axes mean. This usually involves projecting the data to the axes and extracting the factor loadings.
The plot:

How to create a cluster plot in R?

How can I create a cluster plot in R without using clustplot?
I am trying to get to grips with some clustering (using R) and visualisation (using HTML5 Canvas).
Basically, I want to create a cluster plot but instead of plotting the data, I want to get a set of 2D points or coordinates that I can pull into canvas and do something might pretty with (but I am unsure of how to do this). I would imagine that I:
Create a similarity matrix for the entire dataset (using dist)
Cluster the similarity matrix using kmeans or something similar (using kmeans)
Plot the result using MDS or PCA - but I am unsure of how steps 2 and 3 relate (cmdscale).
I've checked out questions here, here and here (with the last one being of most use).
Did you mean something like this?
Sorry but i know nothing about HTML5 Canvas, only R... But I hope it helps...
First I cluster the data using kmeans (note that I did not cluster the distance matrix), than I compute the distance matix and plot it using cmdscale. Then I add colors to the MDS-plot that correspond to the groups identified by kmeans. Plus some nice additional graphical features.
You can access the coordinates from the object created by cmdscale.
### some sample data
require(vegan)
data(dune)
# kmeans
kclus <- kmeans(dune,centers= 4, iter.max=1000, nstart=10000)
# distance matrix
dune_dist <- dist(dune)
# Multidimensional scaling
cmd <- cmdscale(dune_dist)
# plot MDS, with colors by groups from kmeans
groups <- levels(factor(kclus$cluster))
ordiplot(cmd, type = "n")
cols <- c("steelblue", "darkred", "darkgreen", "pink")
for(i in seq_along(groups)){
points(cmd[factor(kclus$cluster) == groups[i], ], col = cols[i], pch = 16)
}
# add spider and hull
ordispider(cmd, factor(kclus$cluster), label = TRUE)
ordihull(cmd, factor(kclus$cluster), lty = "dotted")
Here you can find one graph to analyze cluster results, "coordinate plot", within "clusplot" package.
It is not based on PCA. It uses function scale to have all the variables means in a range of 0 to 1, so you can compare which cluster holds the max/min average for each variable.
install.packages("devtools") ## To be able to download packages from github
library(devtools)
install_github("pablo14/clusplus")
library(clusplus)
## Create k-means model with 3 clusters
fit_mtcars=kmeans(mtcars,3)
## Call the function
plot_clus_coord(fit_mtcars, mtcars)
This post explains how to use it.

Plotting predefined density functions using ggplot and R

I have three data sets of different lengths and I would like to plot density functions of all three on the same plot. This is straight forward with base graphics:
n <- c(rnorm(10000), rnorm(10000))
a <- c(rnorm(10001), rnorm(10001, 0, 2))
p <- c(rnorm(10002), rnorm(10002, 2, .5))
plot(density(n))
lines(density(a))
lines(density(p))
Which gives me something like this:
alt text http://www.cerebralmastication.com/wp-content/uploads/2009/10/density.png
But I really want to do this with GGPLOT2 because I want to add other features that are only available with GGPLOT2. It seems that GGPLOT really wants to take my empirical data and calculate the density for me. And it gives me a bunch of lip because my data sets are of different lengths. So how do I get these three densities to plot in GGPLOT2?
The secret to happiness in ggplot2 is to put everything in the "long" (or what I guess matrix oriented people would call "sparse") format:
df <- rbind(data.frame(x="n",value=n),
data.frame(x="a",value=a),
data.frame(x="p",value=p))
qplot(value, colour=x, data=df, geom="density")
If you don't want colors:
qplot(value, group=x, data=df, geom="density")

Resources