When overlaying ggplot density plots that feature data of same length but different scales is it possible to normalise the x scale for the plots so the densities match up? Alternatively is there a way to normalise the density y scale?
library(ggplot2)
data <- data.frame(x = c('A','B','C','D','E'), y1 = rnorm(100, mean = 0, sd = 1),
y2 = rnorm(100, mean = 0, sd = 50))
p <- ggplot(data)
# Overlaying the density plots is a fail
p + geom_density(aes(x=y1), fill=NA) + geom_density(aes(x=y2), alpha=0.3,col=NA,fill='red')
# You can compress the xscale in the aes() argument:
y1max <- max(data$y1)
y2max <- max(data$y2)
p + geom_density(aes(x=y1), fill=NA) + geom_density(aes(x=y2*y1max/y2max), alpha=0.3,col=NA,fill='red')
# But it doesn't fix the density scale. Any solution?
# And will it work with facet_wrap?
p + geom_density(aes(x=y1), col=NA,fill='grey30') + facet_wrap(~ x, ncol=2)
Thanks!
Does this do what you were hoping for?
p + geom_density(aes(x=scale(y1)), fill=NA) +
geom_density(aes(x=scale(y2)), alpha=0.3,col=NA,fill='red')
The scale function with only a single data argument will center an empiric distribution on 0 and then divide the resulting values by the sample standard deviation so the result has a standard deviation of 1. You can change the defaults for the location and the degree of "compression" or "expansion". You will probably need to investigate putting in appropriate x_scales for y1 and y2. This may take some preprocessing with scale. The scaling factor is recorded in an attribute of the returned object.
attr(scale(data$y2), "scaled:scale")
#[1] 53.21863
Related
Suppose I am trying to generate prediction intervals for two sets of scores, X and Y:
set.seed(1111)
n = 1000
x1 = rnorm(n)
x2 = .5*x1 + rnorm(n, 0, sqrt(1-.25))
x_mod = lm(x2~x1)
x_se = predict(x_mod, interval="prediction", level=.68, se.fit=TRUE)$se.fit
y1 = .4*x1 + rnorm(n, sqrt(1-.16))
y2 = .7*y1 + rnorm(n, 0, sqrt(1-.49))
y_mod = lm(y2~y1)
y_se = predict(y_mod, interval="prediction", level=.68, se.fit=TRUE)$se.fit
Now what I want to do is plot the predicted values of X2 and Y2, but want to visually represent my uncertainty. One way to do this is with an ellipse, rather than a point. However, when I plot an ellipse, it generates one ellipse for the entire scatterplot, rather than an ellipse for each point:
d = data.frame(x1,x2,x2_pred = predict(x_mod), x_se,
y1,y2,y2_pred = predict(y_mod), y_se)
require(ggplot2)
ggplot(data=d, aes(x2_pred, y2_pred)) +
stat_ellipse(mapping=aes(x2_pred, y2_pred))
Does anyone know of a way to do a separate ellipse for each point?
Also, I'm open to other ideas for how to represent this uncertainty. (A point with a gradient of color, perhaps?)
The package ggforce provides a geom_ellipse:
library(ggforce)
ggplot(data=d, aes(x2_pred, y2_pred)) +
geom_ellipse(aes(x0 = x2_pred, y0 = y2_pred, a = x_se, b = y_se, angle = 0))
Another option is to use error bars to plot the points, with or without points...
ggplot(data=d, aes(x2_pred, y2_pred)) +
# geom_point(alpha=0.2) +
geom_errorbar(aes(ymin=y2_pred-y_se, ymax=y2_pred+y_se)) +
geom_errorbarh(aes(xmin=x2_pred-x_se, xmax=x2_pred+x_se))
This approach nicely shows that the error is smallest close to the means for both x and y, and grows in the appropriate direction farther away. You could play around with themes and alpha to get something that looks nicer. The second looks a little cleaner to me, but it depends on the message you're trying to send.
I'm trying to plot a few Binomial distributions and show that as N increases, the curve looks more and more like the normal. I've tried using dbinom, but here's what I get:
Here's the code I'm using to produce this distribution:
x <- -5:250
y10 <- dbinom(x, 10, 0.5)
y30 <- dbinom(x, 30, 0.5)
y60 <- dbinom(x, 60, 0.5)
y100 <- dbinom(x, 100, 0.5)
ynorm <- dnorm(x, mean=-1, sd=1)
y10 <- y10 * sqrt(y10) / 0.8
y30 <- y30 * sqrt(y30) / 0.8
y60 <- y60 * sqrt(y60) / 0.8
y100 <- y100 * sqrt(y100) / 0.8
y10 <- y10[7:17]
y30 <- y30[17:27]
y60 <- y60[32:42]
y100 <- y100[52:62]
plot(range(0, 10), range(0, 0.5), type = "n")]
lines(ynorm, col = "red", type = "l")
lines(y10, col = "blue", type = "l")
lines(y30, col = "orange", type = "l")
lines(y60, col = "green", type = "l")
lines(y100, col = "yellow", type = "l")
Does anyone know how to correctly adjust a binomial distribution in R?
Theoretically an N of 1000 should make it look like a normal distribution, but I have no clue how to get there, and I've tried/failed to use ggplot2 :(
You can rescale the x values so that x==0 always occurs at the peak density for each binomial density. You can do this by finding the x value at which the density is a maximum for each of the densities. For example:
library(ggplot2)
theme_set(theme_classic())
library(dplyr)
x <- -5:250
n = c(6,10,30,60,100)
p = 0.5
binom = data.frame(x=rep(x, length(n)),
y=dbinom(x, rep(n, each=length(x)), p),
n=rep(n, each=length(x)))
ggplot(binom %>% filter(y > 1e-5) %>%
group_by(n) %>%
mutate(x = x - x[which.max(y)]),
aes(x, y, colour=factor(n))) +
geom_line() + geom_point(size=0.6) +
labs(colour="n")
In reference to your comment, here's one way to add a normal density in addition to the binomial density: The mean of a binomial distribution is n*p, where n is the number of trials and p is the probability of success. The variance is n*p*(1-p). So, for each of the binomial densities above, we want normal densities with the same mean and variance. We create a data frame of these below and then plot the binomial and normal densities together.
First, create a new vector of x values that includes a higher density of points, to reflect the fact that the normal distribution is continuous, rather than discrete:
x = seq(-5,250,length.out=2000)
Now we create a data frame of normal densities with the same means and variances as the binomial densities above:
normal=data.frame(x=rep(x, length(n)),
y=dnorm(x, rep(n,each=length(x))*p, (rep(n, each=length(x))*p*(1-p))^0.5),
n=rep(n, each=length(x)))
# Cut off y-values below ymin
ymin = 1e-3
So now we have two data frames to plot. We still add the binom data frame in the main call to ggplot. But here we also add a call to geom_line for plotting the normal densities. And we give geom_line the normal data frame. Also, for this plot we've used geom_segment to emphasize the discrete points of the binomial density (you could also use geom_bar for this).
ggplot(binom %>% filter(y > ymin), aes(x, y)) +
geom_point(size=1.2, colour="blue") +
geom_line(data=normal %>% filter(y > ymin), lwd=0.7, colour="red") +
geom_segment(aes(x=x, xend=x, y=0, yend=y), lwd=0.8, alpha=0.5, colour="blue") +
facet_grid(. ~ n, scales="free", space="free")
Here's what the new plot looks like. You can change the scaling in various ways and there are probably many other ways to tweak it, depending on what you want to emphasize.
The function below calculates binned averages, sizes the bin points on the graph relative to the number of observations in each bin, and plots a lowess line through the bin means. Instead of plotting the lowess line through the bin means, however, I would like to plot the line through the original dataset so that the error bands on the lowess line represent the uncertainty in the actual dataset, not the uncertainty in the binned averages. How do I modify geom_smooth() so that it will plot the line using df instead of dfplot?
library(fields)
library(ggplot2)
binplot <- function(df, yvar, xvar, sub = FALSE, N = 50, size = 40, xlabel = "X", ylabel = "Y"){
if(sub != FALSE){
df <- subset(df, eval(parse(text = sub)))
}
out <- stats.bin(df[,xvar], df[,yvar], N= N)
x <- out$centers
y <- out$stats[ c("mean"),]
n <- out$stats[ c("N"),]
dfplot <- as.data.frame(cbind(x,y,n))
if(size != FALSE){
sizes <- n * (size/max(n))
}else{
sizes = 3
}
ggplot(dfplot, aes(x,y)) +
xlab(xlabel) +
ylab(ylabel) +
geom_point(shape=1, size = sizes) +
geom_smooth()
}
Here is a reproducible example that demonstrates how the function currently works:
sampleSize <- 10000
x1 <- rnorm(n=sampleSize, mean = 0, sd = 4)
y1 <- x1 * 2 + x1^2 * .3 + rnorm(n=sampleSize, mean = 5, sd = 10)
binplot(data.frame(x1,y1), "y1", "x1", N = 25)
As you can see, the error band on the lowess line reflects the uncertainty if each bin had an equal number of observations, but they do not. The bins at the extremes have far fewer obseverations (as illustrated by the size of the points) and the lowess line's error band should reflect that.
You can explicitly set the data= parameter for each layer. You will also need to change the aesthetic mapping since the original data.frame had different column names. Just change your geom_smooth call to
geom_smooth(data=df, aes_string(xvar, yvar))
with the sample data, this returned
I have two datasets with two continuous variables: duration and waiting.
library("MASS")
data(geyser)
geyser1 <- geyser[1:150,]
geyser2 <- geyser[151:299,]
geyser2$duration <- geyser2$duration - 1
geyser2$waiting <- geyser2$waiting - 20
For each dataset I output a 2D density plot
ggplot(geyser1, aes(x = duration, y = waiting)) +
xlim(0.5, 6) + ylim(40, 110) +
stat_density2d(aes(alpha=..level..),
geom="polygon", bins = 10)
ggplot(geyser2, aes(x = duration, y = waiting)) +
xlim(0.5, 6) + ylim(40, 110) +
stat_density2d(aes(alpha=..level..),
geom="polygon", bins = 10)
I now want to produce a plot which indicates the regions where the two plot have the same density (white), negative differences (gradation from white to blue where geyser2 is denser than geyser1) and positive differences (gradation from white to red where geyser1 is denser than geyser2).
How to compute and plot the difference of the densities?
You can do this by first using kde2d to calculate the densities and then subtracting them from each other. Then you do some data reshaping to get it into a form that can be fed to ggplot2.
library(reshape2) # For melt function
# Calculate the common x and y range for geyser1 and geyser2
xrng = range(c(geyser1$duration, geyser2$duration))
yrng = range(c(geyser1$waiting, geyser2$waiting))
# Calculate the 2d density estimate over the common range
d1 = kde2d(geyser1$duration, geyser1$waiting, lims=c(xrng, yrng), n=200)
d2 = kde2d(geyser2$duration, geyser2$waiting, lims=c(xrng, yrng), n=200)
# Confirm that the grid points for each density estimate are identical
identical(d1$x, d2$x) # TRUE
identical(d1$y, d2$y) # TRUE
# Calculate the difference between the 2d density estimates
diff12 = d1
diff12$z = d2$z - d1$z
## Melt data into long format
# First, add row and column names (x and y grid values) to the z-value matrix
rownames(diff12$z) = diff12$x
colnames(diff12$z) = diff12$y
# Now melt it to long format
diff12.m = melt(diff12$z, id.var=rownames(diff12))
names(diff12.m) = c("Duration","Waiting","z")
# Plot difference between geyser2 and geyser1 density
ggplot(diff12.m, aes(Duration, Waiting, z=z, fill=z)) +
geom_tile() +
stat_contour(aes(colour=..level..), binwidth=0.001) +
scale_fill_gradient2(low="red",mid="white", high="blue", midpoint=0) +
scale_colour_gradient2(low=muted("red"), mid="white", high=muted("blue"), midpoint=0) +
coord_cartesian(xlim=xrng, ylim=yrng) +
guides(colour=FALSE)
I'm trying to plot a line, smoothed by loess, but I'm trying to figure out how to include shaded error areas defined by existing variables, but also smoothed.
This code creates example data:
set.seed(12345)
data <- cbind(rep("A", 100), rnorm(100, 0, 1))
data <- rbind(data, cbind(rep("B", 100), rnorm(100, 5, 1)))
data <- rbind(data, cbind(rep("C", 100), rnorm(100, 10, 1)))
data <- rbind(data, cbind(rep("D", 100), rnorm(100, 15, 1)))
data <- cbind(rep(1:100, 4), data)
data <- data.frame(data)
names(data) <- c("num", "category", "value")
data$num <- as.numeric(data$num)
data$value <- as.numeric(data$value)
data$upper <- data$value+0.20
data$lower <- data$value-0.30
Plotting the data below, this is what I get:
ggplot(data, aes(x=num, y=value, colour=category)) +
stat_smooth(method="loess", se=F)
What I'd like is a plot that looks like the following, except with the upper and lower bounds of the shaded areas being bounded by smoothed lines of the "upper" and "lower" variables in the generated data.
Any help would be greatly appreciated.
Here's one way to add smoothed versions of upper and lower. We'll add LOESS predictions for upper and lower to the data frame and then plot those using geom_ribbon. It would be more elegant if this could all be done within the call to ggplot. That's probably possible by feeding a special-purpose function to stat_summary, and hopefully someone else will post an answer using that approach.
# Expand the scale of the upper and lower values so that the difference
# is visible in the plot
data$upper = data$value + 10
data$lower = data$value - 10
# Order data by category and num
data = data[order(data$category, data$num),]
# Create LOESS predictions for the values of upper and lower
# and add them to the data frame. I'm sure there's a better way to do this,
# but my attempts with dplyr and tapply both failed, so I've resorted to the clunky
# method below.
data$upperLoess = unlist(lapply(LETTERS[1:4],
function(x) predict(loess(data$upper[data$category==x] ~
data$num[data$category==x]))))
data$lowerLoess = unlist(lapply(LETTERS[1:4],
function(x) predict(loess(data$lower[data$category==x] ~
data$num[data$category==x]))))
# Use geom_ribbon to add a prediction band bounded by the LOESS predictions for
# upper and lower
ggplot(data, aes(num, value, colour=category, fill=category)) +
geom_smooth(method="loess", se=FALSE) +
geom_ribbon(aes(x=num, y=value, ymax=upperLoess, ymin=lowerLoess),
alpha=0.2)
And here's the result: