Cluster analysis representation in R - r

I'm doing a cluster analysis on a large spatial dataset where x and y are the spatial coordinates. I'm using hclust and the dynamicTreeCut, and then for the representation I'm showing a scatterplot using ggplot.
But I'm unable to visualize correctly clusters, because I can show a cut only in rows or in columns.
This is a toy sample of what I want, but with my data, the cluster works only in horizontal (so I think it should cluster or visualize only for y).
# x and y are coordinates of points in a map
x <- sample(1:1000, 80)
y <- sample(1:1000, 80)
df <- data.frame(x,y)
hc <- hclust(dist(df, "euclidean"),"complete");
maxCoreScatter <- 0.81
minGap <- (1 - maxCoreScatter) * 3/4
groups <- cutreeDynamic(hc, minClusterSize=1, method="hybrid", distM=as.matrix(dist(df)), deepSplit=4, maxCoreScatter=maxCoreScatter, minGap=minGap)
df$group <- groups
p <- ggplot(df, aes(x,y,colour=factor(groups))) + geom_point() + theme(aspect.ratio = 1)
p

Related

In ggplot2 is there a relatively simple way of using different geoms for different groups in the data?

I have a set of data with multiple groups. I'd like to plot them on the same graph but with, say, a smooth line for one group and the data points for the other. Or with smooth lines for both, but data points for only one of them. An example:
library(reshape)
library(ggplot2)
set.seed(123)
x <- 1:1000
y <- 5 + rnorm(1000)
z <- 5 + 0.005*x + rnorm(1000)
df <- as.data.frame(cbind(x,y,z))
df <- melt(df,id=c("x"))
ggplot(df,aes(x=x,y=value,color=variable)) +
geom_point() + #here I want only the y variable graphed
geom_smooth() #here I want only the z variable graphed
They are both graphed against the x variable, and are on the same scale. Is there a relatively easy way to accomplish this?
Set the data parameter with the filtered data on each plot type
library(ggplot2)
library(reshape)
set.seed(123)
x <- 1:1000
y <- 5 + rnorm(1000)
z <- 5 + 0.005*x + rnorm(1000)
df <- as.data.frame(cbind(x,y,z))
df <- reshape::melt(df,id=c("x"))
df
ggplot(df,aes(x=x,y=value,color=variable)) +
geom_point(data=df[df$variable=="y",]) + #here I want only the y variable graphed
geom_smooth(data=df[df$variable=="z",]) #here I want only the z variable graphed

ggplot with points and plots of certain columns

Let's say I have a data.frame of three columns:
x <- seq(1,10)
y <- 0.1*x^2
z <- y+rnorm(10,0,10)
d <- data.frame(x,y,z)
I now want a ggplot that plots the points (x,z) and somewhat smooth lines going through (x,y).
How can I achieve that?
"%>%" <- magrittr::"%>%"
d %>%
ggplot2::ggplot(ggplot2::aes(x=x)) +
ggplot2::geom_point(ggplot2::aes(y=z)) +
ggplot2::geom_smooth(ggplot2::aes(y=y))

Hexbin in R ggplot - hexagons get bigger if data is too sparse

I'm generating a series of hexbin plots for use in an animated GIF, and there are occasional frames that have a low density of data. The plots seem to create giant, misshapen hexagons.
Here is an example that works as expected:
library(ggplot2)
set.seed(23)
x <- rnorm(10000)
y <- rnorm(10000)
temp <- data.frame(x, y)
ggplot(temp) + stat_binhex(aes(x=x,y=y), bins=30) + scale_fill_gradientn(colours=c("white","blue"))
However, limiting it to 3 data points gives abnormal bins:
set.seed(23)
x2 <- rnorm(3)
y2 <- rnorm(3)
temp2 <- data.frame(x2, y2)
ggplot(temp2) + stat_binhex(aes(x=x2,y=y2), bins=30) + scale_fill_gradientn(colours=c("white","blue"))
I'd like to keep the same hexagon sizes (bins=30) and just have it color the 3 hexagons that contain data.

Is there a way to create a "star" plot using ggplot?

I'm trying to (partially) reproduce the cluster plot available throught s.class(...) in package ade4 using ggplot, but this question is actually much more general.
NB: This question refers to "star plots", but really only discusses spider plots.
df <- mtcars[,c(1,3,4,5,6,7)]
pca <-prcomp(df, scale.=T, retx=T)
scores <-data.frame(pca$x)
library(ade4)
km <- kmeans(df,centers=3)
plot.df <- cbind(scores$PC1, scores$PC2)
s.class(plot.df, factor(km$cluster))
The essential feature I'm looking for is the "stars", e.g. a set of lines radiating from a common point (here, the cluster centroids) to a number of other points (here, the points in the cluster).
Is there a way to do that using the ggplot package? If not directly through ggplot, then does anyone know of an add-in that works. For example, there are several variations on stat_ellipse(...) which is not part of the ggplot package (here, and here).
This answer is based on #agstudy's response and the suggestions made in #Henrik's comment. Posting because it's shorter and more directly applicable to the question.
Bottom line is this: star plots are readily made with ggplot using geom_segment(...). Using df, pca, scores, and km from the question:
# build ggplot dataframe with points (x,y) and corresponding groups (cluster)
gg <- data.frame(cluster=factor(km$cluster), x=scores$PC1, y=scores$PC2)
# calculate group centroid locations
centroids <- aggregate(cbind(x,y)~cluster,data=gg,mean)
# merge centroid locations into ggplot dataframe
gg <- merge(gg,centroids,by="cluster",suffixes=c("",".centroid"))
# generate star plot...
ggplot(gg) +
geom_point(aes(x=x,y=y,color=cluster), size=3) +
geom_point(data=centroids, aes(x=x, y=y, color=cluster), size=4) +
geom_segment(aes(x=x.centroid, y=y.centroid, xend=x, yend=y, color=cluster))
Result is identical to that obtained with s.class(...).
The difficulty here is to create data not the plot itself. You should go through the code of the package and extract what it is useful for you. This should be a good start :
dfxy <- plot.df
df <- data.frame(dfxy)
x <- df[, 1]
y <- df[, 2]
fac <- factor(km$cluster)
f1 <- function(cl) {
n <- length(cl)
cl <- as.factor(cl)
x <- matrix(0, n, length(levels(cl)))
x[(1:n) + n * (unclass(cl) - 1)] <- 1
dimnames(x) <- list(names(cl), levels(cl))
data.frame(x)
}
wt = rep(1, length(fac))
dfdistri <- f1(fac) * wt
w1 <- unlist(lapply(dfdistri, sum))
dfdistri <- t(t(dfdistri)/w1)
## create a data.frame
cstar=2
ll <- lapply(seq_len(ncol(dfdistri)),function(i){
z1 <- dfdistri[,i]
z <- z1[z1>0]
x <- x[z1>0]
y <- y[z1>0]
z <- z/sum(z)
x1 <- sum(x * z)
y1 <- sum(y * z)
hx <- cstar * (x - x1)
hy <- cstar * (y - y1)
dat <- data.frame(x=x1, y=y1, xend=x1 + hx, yend=y1 + hy,center=factor(i))
})
dat <- do.call(rbind,ll)
library(ggplot2)
ggplot(dat,aes(x=x,y=y))+
geom_point(aes(shape=center)) +
geom_segment(aes(yend=yend,xend=xend,color=center,group=center))

RGL surface plot from data frame

I've created a nice plot using scatter3d() and Rcmdr. That plot contains two nice surface smooths. Now I'd like to add to this plot one more surface, the truth (i.e. the surface defined by the function generating my observations minus the noise component).
Here is my code so far:
library(car)
set.seed(1)
n <- 200 # number of observations (x,y,z) to be generated
sd <- 0.3 # standard deviation for error term
x <- runif(n) # generate x component
y <- runif(n) # generate y component
r <- sqrt(x^2+y^2) # used to compute z values
z_t <- sin(x^2+3*y^2)/(0.1+r^2) + (x^2+5*y^2)*exp(1-r^2)/2 # calculate values of true regression function
z <- z_t + rnorm(n, sd = sd) # overlay normally distrbuted 'noise'
dm <- data.frame(x=x, y=y, z=z) # data frame containing (x,y,z) observations
dm_t <- data.frame(x=x,y=y, z=z_t) # data frame containing (x,y) observations and the corresponding value of the *true* regression function
# Create 3D scatterplot of:
# - Observations (this includes 'noise')
# - Surface given by Additive Model fit
# - Surface given by bivariate smoother fit
scatter3d(dm$x, dm$y, dm$z, fit=c("smooth","additive"), bg="white",
axis.scales=TRUE, grid=TRUE, ellipsoid=FALSE, xlab="x", ylab="z", zlab="y")
The solution given in another thread is to then define a function:
my_surface <- function(f, n=10, ...) {
ranges <- rgl:::.getRanges()
x <- seq(ranges$xlim[1], ranges$xlim[2], length=n)
y <- seq(ranges$ylim[1], ranges$ylim[2], length=n)
z <- outer(x,y,f)
surface3d(x, y, z, ...)
}
f <- function(x, y)
sin(x^2+3*y^2)/(0.1+r^2) + (x^2+5*y^2)*exp(1-r^2)/2
my_surface(f, alpha=0.2)
This however yields an error, saying (translated from German since this is my system language, I apologize):
Error in outer(x, y, f) :
Dimension [Product 100] does not match the length of the object [200]
I then tried an alternative approach:
x <- seq(0,1,length=20)
y <- x
z <- outer(x,y,f)
surface3d(x,y,z)
This does add a surface to my plot but it doesn't look right at all (i.e. the observations are not even close to it). Here's what the supposed true surface looks like (this is obviously wrong):
Thanks!
I think the problem may in fact be scaling. Here I created a couple of points that sit on the plane z = x+y. Then I proceeded to try to plot that plane using my method above:
library(car)
n <- 50
x <- runif(n)
y <- runif(n)
z <- x+y
scatter3d(x,y,z, surface = FALSE)
f <- function(x,y)
x + y
x_grid <- seq(0,1, length=20)
y_grid <- x_grid
z_grid <- outer(x_grid, y_grid, f)
surface3d(x_grid, y_grid, z_grid)
This gives me the following plot:
Maybe one of you can help me out with this?
The scatter3d function in car rescales data before plotting it, which makes it incompatible with essentially all rgl plotting functions, including surface3d.
You can get a plot something like what you want by using all rgl functions, e.g. plot3d(x, y, z) in place of scatter3d, but of course it will have rgl-style axes rather than car-style axes.

Resources