I'm trying to (partially) reproduce the cluster plot available throught s.class(...) in package ade4 using ggplot, but this question is actually much more general.
NB: This question refers to "star plots", but really only discusses spider plots.
df <- mtcars[,c(1,3,4,5,6,7)]
pca <-prcomp(df, scale.=T, retx=T)
scores <-data.frame(pca$x)
library(ade4)
km <- kmeans(df,centers=3)
plot.df <- cbind(scores$PC1, scores$PC2)
s.class(plot.df, factor(km$cluster))
The essential feature I'm looking for is the "stars", e.g. a set of lines radiating from a common point (here, the cluster centroids) to a number of other points (here, the points in the cluster).
Is there a way to do that using the ggplot package? If not directly through ggplot, then does anyone know of an add-in that works. For example, there are several variations on stat_ellipse(...) which is not part of the ggplot package (here, and here).
This answer is based on #agstudy's response and the suggestions made in #Henrik's comment. Posting because it's shorter and more directly applicable to the question.
Bottom line is this: star plots are readily made with ggplot using geom_segment(...). Using df, pca, scores, and km from the question:
# build ggplot dataframe with points (x,y) and corresponding groups (cluster)
gg <- data.frame(cluster=factor(km$cluster), x=scores$PC1, y=scores$PC2)
# calculate group centroid locations
centroids <- aggregate(cbind(x,y)~cluster,data=gg,mean)
# merge centroid locations into ggplot dataframe
gg <- merge(gg,centroids,by="cluster",suffixes=c("",".centroid"))
# generate star plot...
ggplot(gg) +
geom_point(aes(x=x,y=y,color=cluster), size=3) +
geom_point(data=centroids, aes(x=x, y=y, color=cluster), size=4) +
geom_segment(aes(x=x.centroid, y=y.centroid, xend=x, yend=y, color=cluster))
Result is identical to that obtained with s.class(...).
The difficulty here is to create data not the plot itself. You should go through the code of the package and extract what it is useful for you. This should be a good start :
dfxy <- plot.df
df <- data.frame(dfxy)
x <- df[, 1]
y <- df[, 2]
fac <- factor(km$cluster)
f1 <- function(cl) {
n <- length(cl)
cl <- as.factor(cl)
x <- matrix(0, n, length(levels(cl)))
x[(1:n) + n * (unclass(cl) - 1)] <- 1
dimnames(x) <- list(names(cl), levels(cl))
data.frame(x)
}
wt = rep(1, length(fac))
dfdistri <- f1(fac) * wt
w1 <- unlist(lapply(dfdistri, sum))
dfdistri <- t(t(dfdistri)/w1)
## create a data.frame
cstar=2
ll <- lapply(seq_len(ncol(dfdistri)),function(i){
z1 <- dfdistri[,i]
z <- z1[z1>0]
x <- x[z1>0]
y <- y[z1>0]
z <- z/sum(z)
x1 <- sum(x * z)
y1 <- sum(y * z)
hx <- cstar * (x - x1)
hy <- cstar * (y - y1)
dat <- data.frame(x=x1, y=y1, xend=x1 + hx, yend=y1 + hy,center=factor(i))
})
dat <- do.call(rbind,ll)
library(ggplot2)
ggplot(dat,aes(x=x,y=y))+
geom_point(aes(shape=center)) +
geom_segment(aes(yend=yend,xend=xend,color=center,group=center))
Related
Let's say I have a data.frame of three columns:
x <- seq(1,10)
y <- 0.1*x^2
z <- y+rnorm(10,0,10)
d <- data.frame(x,y,z)
I now want a ggplot that plots the points (x,z) and somewhat smooth lines going through (x,y).
How can I achieve that?
"%>%" <- magrittr::"%>%"
d %>%
ggplot2::ggplot(ggplot2::aes(x=x)) +
ggplot2::geom_point(ggplot2::aes(y=z)) +
ggplot2::geom_smooth(ggplot2::aes(y=y))
Taking some generic data
A <- c(1997,2000,2000,1998,2000,1997,1997,1997)
B <- c(0,0,1,0,0,1,0,0)
df <- data.frame(A,B)
counts <- t(table(A,B))
frac <- counts[1,]/(counts[2,]+counts[1,])
C <- c(1998,2001,2000,1995,2000,1996,1998,1999)
D <- c(1,0,1,0,0,1,0,1)
df2 <- data.frame(C,D)
counts2 <- t(table(C,D))
frac2 <- counts2[1,]/(counts2[2,]+counts2[1,])
If we then want to create a scatterplot for the two datasets on the one scale
We can:
plot(frac, pch=22)
points(frac2, pch=19)
But we see we have two problems
first we want to put our year values (which appear as df$A and df$C) along the x axis
We want the x axis to automatically adjust the scale when the second data is added.
A solution using ggplot2 or base R would be desired
ggplot will do the scaling for you. You can convert the fracs to data.frame and to use with ggplot
library(ggplot2)
ggplot(data.frame(y=frac, x=names(frac)), aes(x, y)) +
geom_point(col="salmon") +
geom_point(data=data.frame(y=frac2, x=names(frac2)), aes(x, y), col="steelblue") +
theme_bw()
I'm doing a cluster analysis on a large spatial dataset where x and y are the spatial coordinates. I'm using hclust and the dynamicTreeCut, and then for the representation I'm showing a scatterplot using ggplot.
But I'm unable to visualize correctly clusters, because I can show a cut only in rows or in columns.
This is a toy sample of what I want, but with my data, the cluster works only in horizontal (so I think it should cluster or visualize only for y).
# x and y are coordinates of points in a map
x <- sample(1:1000, 80)
y <- sample(1:1000, 80)
df <- data.frame(x,y)
hc <- hclust(dist(df, "euclidean"),"complete");
maxCoreScatter <- 0.81
minGap <- (1 - maxCoreScatter) * 3/4
groups <- cutreeDynamic(hc, minClusterSize=1, method="hybrid", distM=as.matrix(dist(df)), deepSplit=4, maxCoreScatter=maxCoreScatter, minGap=minGap)
df$group <- groups
p <- ggplot(df, aes(x,y,colour=factor(groups))) + geom_point() + theme(aspect.ratio = 1)
p
I have the following script that emulates the type of data structure I have and analysis that I want to do on it,
library(ggplot2)
library(reshape2)
n <- 10
df <- data.frame(t=seq(n)*0.1, a =sort(rnorm(n)), b =sort(rnorm(n)),
a.1=sort(rnorm(n)), b.1=sort(rnorm(n)),
a.2=sort(rnorm(n)), b.2=sort(rnorm(n)))
head(df)
mdf <- melt(df, id=c('t'))
## head(mdf)
levels(mdf$variable) <- rep(c('a','b'),3)
g <- ggplot(mdf,aes(t,value,group=variable,colour=variable))
g +
stat_smooth(method='lm', formula = y ~ ns(x,3)) +
geom_point() +
facet_wrap(~variable) +
opts()
What I would like to do in addition to this is plot the first derivative of the smoothing function against t and against the factors, c('a','b'), as well. Any suggestions how to go about this would be greatly appreciated.
You'll have to construct the derivative yourself, and there are two possible ways for that. Let me illustrate by using only one group :
require(splines) #thx #Chase for the notice
lmdf <- mdf[mdf$variable=="b",]
model <- lm(value~ns(t,3),data=lmdf)
You then simply define your derivative as diff(Y)/diff(X) based on your predicted values, as you would do for differentiation of a discrete function. It's a very good approximation if you take enough X points.
X <- data.frame(t=seq(0.1,1.0,length=100) ) # make an ordered sequence
Y <- predict(model,newdata=X) # calculate predictions for that sequence
plot(X$t,Y,type="l",main="Original fit") #check
dY <- diff(Y)/diff(X$t) # the derivative of your function
dX <- rowMeans(embed(X$t,2)) # centers the X values for plotting
plot(dX,dY,type="l",main="Derivative") #check
As you can see, this way you obtain the points for plotting the derivative. You'll figure out from here how to apply this to both levels and combine those points to the plot you like. Below the plots from this sample code :
Here's one approach to plotting this with ggplot. There may be a more efficient way to do it, but this uses the manual calculations done by #Joris. We'll simply construct a long data.frame with all of the X and Y values while also supplying a variable to "facet" the plots:
require(ggplot2)
originalData <- data.frame(X = X$t, Y, type = "Original")
derivativeData <- data.frame(X = dX, Y = dY, type = "Derivative")
plotData <- rbind(originalData, derivativeData)
ggplot(plotData, aes(X,Y)) +
geom_line() +
facet_wrap(~type, scales = "free_y")
If data is smoothed using smooth.spline, the derivative of predicted data can be specified using the argument deriv in predict. Following from #Joris's solution
lmdf <- mdf[mdf$variable == "b",]
model <- smooth.spline(x = lmdf$t, y = lmdf$value)
Y <- predict(model, x = seq(0.1,1.0,length=100), deriv = 1) # first derivative
plot(Y$x[, 1], Y$y[, 1], type = 'l')
Any dissimilarity in the output is most likely due to differences in the smoothing.
Let's say I've got this dataframe with 2 levels. LC and HC.
Now i want to get 2 plots like below on top of eachother.
data <- data.frame(
welltype=c("LC","LC","LC","LC","LC","HC","HC","HC","HC","HC"),
value=c(1,2,1,2,1,5,4,5,4,5))
The code to get following plot =
x <- rnorm(1000)
y <- hist(x)
plot(y$breaks,
c(y$counts,0),
type="s",col="blue")
(with thanks to Joris Meys)
So, how do I even start on this. Since I'm used to java I was thinking of a for loop, but I've been told not to do it this way.
Next to the method provided by Aaron, there's a ggplot solution as well (see below),
but I would strongly advise you to use the densities, as they will give nicer plots and are a whole lot easier to construct :
# make data
wells <- c("LC","HC","BC")
Data <- data.frame(
welltype=rep(wells,each=100),
value=c(rnorm(100),rnorm(100,2),rnorm(100,3))
)
ggplot(Data,aes(value,fill=welltype)) + geom_density(alpha=0.2)
gives :
For the plot you requested :
# make hists dataframe
hists <- tapply(Data$value,Data$welltype,
function(i){
tmp <- hist(i)
data.frame(br=tmp$breaks,co=c(tmp$counts,0))
})
ll <- sapply(hists,nrow)
hists <- do.call(rbind,hists)
hists$fac <- rep(wells,ll)
# make plot
require(ggplot2)
qplot(br,co,data=hists,geom="step",colour=fac)
You can use the same code except with points instead of plot for adding additional lines to the plot.
Making up some data
set.seed(5)
d <- data.frame(x=c(rnorm(1000)+3, rnorm(1000)),
g=rep(1:2, each=1000) )
And doing it in a fairly straightforward way:
x1 <- d$x[d$g==1]
x2 <- d$x[d$g==2]
y1 <- hist(x1, plot=FALSE)
y2 <- hist(x2, plot=FALSE)
plot(y1$breaks, c(y1$counts,0), type="s",col="blue",
xlim=range(c(y1$breaks, y2$breaks)), ylim=range(c(0,y1$counts, y2$counts)))
points(y2$breaks, c(y2$counts,0), type="s", col="red")
Or in a more R-ish way:
col <- c("blue", "red")
ds <- split(d$x, d$g)
hs <- lapply(ds, hist, plot=FALSE)
plot(0,0,type="n",
ylim=range(c(0,unlist(lapply(hs, function(x) x$counts)))),
xlim=range(unlist(lapply(hs, function(x) x$breaks))) )
for(i in seq_along(hs)) {
points(hs[[i]]$breaks, c(hs[[i]]$counts,0), type="s", col=col[i])
}
EDIT: Inspired by Joris's answer, I'll note that lattice can also easily do overlapping density plots.
library(lattice)
densityplot(~x, group=g, data=d)