Custom ggplot2 shaded error areas on categorical line plot - r

I'm trying to plot a line, smoothed by loess, but I'm trying to figure out how to include shaded error areas defined by existing variables, but also smoothed.
This code creates example data:
set.seed(12345)
data <- cbind(rep("A", 100), rnorm(100, 0, 1))
data <- rbind(data, cbind(rep("B", 100), rnorm(100, 5, 1)))
data <- rbind(data, cbind(rep("C", 100), rnorm(100, 10, 1)))
data <- rbind(data, cbind(rep("D", 100), rnorm(100, 15, 1)))
data <- cbind(rep(1:100, 4), data)
data <- data.frame(data)
names(data) <- c("num", "category", "value")
data$num <- as.numeric(data$num)
data$value <- as.numeric(data$value)
data$upper <- data$value+0.20
data$lower <- data$value-0.30
Plotting the data below, this is what I get:
ggplot(data, aes(x=num, y=value, colour=category)) +
stat_smooth(method="loess", se=F)
What I'd like is a plot that looks like the following, except with the upper and lower bounds of the shaded areas being bounded by smoothed lines of the "upper" and "lower" variables in the generated data.
Any help would be greatly appreciated.

Here's one way to add smoothed versions of upper and lower. We'll add LOESS predictions for upper and lower to the data frame and then plot those using geom_ribbon. It would be more elegant if this could all be done within the call to ggplot. That's probably possible by feeding a special-purpose function to stat_summary, and hopefully someone else will post an answer using that approach.
# Expand the scale of the upper and lower values so that the difference
# is visible in the plot
data$upper = data$value + 10
data$lower = data$value - 10
# Order data by category and num
data = data[order(data$category, data$num),]
# Create LOESS predictions for the values of upper and lower
# and add them to the data frame. I'm sure there's a better way to do this,
# but my attempts with dplyr and tapply both failed, so I've resorted to the clunky
# method below.
data$upperLoess = unlist(lapply(LETTERS[1:4],
function(x) predict(loess(data$upper[data$category==x] ~
data$num[data$category==x]))))
data$lowerLoess = unlist(lapply(LETTERS[1:4],
function(x) predict(loess(data$lower[data$category==x] ~
data$num[data$category==x]))))
# Use geom_ribbon to add a prediction band bounded by the LOESS predictions for
# upper and lower
ggplot(data, aes(num, value, colour=category, fill=category)) +
geom_smooth(method="loess", se=FALSE) +
geom_ribbon(aes(x=num, y=value, ymax=upperLoess, ymin=lowerLoess),
alpha=0.2)
And here's the result:

Related

How do I add the curve from GauPro in ggplot?

GauPro is an R library for fitting gaussian processes. You can also get it to produce a nuce predicted curve for you.
The documentation for GauPro uses builtin r plotting functions to do plots like this:
gp <- GauPro(x,y) ## fit a gaussian process model to x & y
plot(x,y) ## plots the x,y points
curve(gp$predict(x), add=T, col=2) ## adds the predicted curve from the gaussian process
What would be the equivalent using ggplot? I can get the points to show up, but I can't quite figure out how to add the curve.
GauPro documentation I refer to is here
We can do this by building a little data frame of predictions. Let's start by loading the necessary packages and creating some sample data:
library(GauPro)
library(ggplot2)
set.seed(69)
x <- 1:10
y <- cumsum(runif(10))
Now we can create our model and plot it using the same plotting functions shown in the vignette you linked:
gp <- GauPro(x, y)
plot(x, y)
curve(gp$predict(x), add = TRUE, col = 2)
Now if we want to customize this plot using ggplot, we need a data frame with columns for the x values at which we wish to predict, the y prediction at that point, and a column each for upper and lower 95% confidence intervals. We can obtain the x values like this:
new_x <- seq(min(x), max(x), length.out = 100)
and we can get the three sets of corresponding y values using predict like this:
predict_df <- predict(gp, new_x, se.fit = TRUE)
predict_df$x <- new_x
predict_df$y <- predict_df$mean
predict_df$lower <- predict_df$y - 1.96 * predict_df$se
predict_df$upper <- predict_df$y + 1.96 * predict_df$se
this is now quite straightforward to plot in ggplot with themes customized as you choose:
ggplot(data.frame(x, y), aes(x, y)) +
geom_point() +
geom_line(data = predict_df, color = "deepskyblue4", linetype = 2) +
geom_ribbon(data = predict_df, aes(ymin = lower, ymax = upper),
alpha = 0.2, fill = "deepskyblue4") +
theme_minimal()
Created on 2020-07-29 by the reprex package (v0.3.0)

How to delete outliers from a QQ-plot graph made with ggplot()?

I have a two dimensional dataset (say columns x and y). I use the following function to plot a QQ-plot of this data.
# Creating a toy data for presentation
df = cbind(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
# Plotting the QQ-plot
df_qq = as.data.frame(qqplot(df[,1], df[,2], plot.it=FALSE))
ggplot(df_qq) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
That is the resulting graph:
My question is, how to avoid plotting the last point (i.e. (12,32))? I would rather not delete it manually because i have several of these data pairs and there are similar outliers in each of them. What I would like to do is to write a code that somehow identifies the points that are too far from the 45 degree line and eliminate them from df_qq (for instance if it is 5 times further than the average distance to the 45 line it can be eliminated). My main objective is to make the graph easier to read. When outliers are not eliminated the more regular part of the QQ-plot occupies a too small part of the graph and it prevents me from visually evaluating the similarity of two vectors apart from the outliers.
I would appreciate any help.
There is a CRAN package, referenceIntervals that uses Cook's distance to detect outliers. By applying it to the values of df_qq$y it can then give an index into df_qq to be removed.
library(referenceIntervals)
out <- cook.outliers(df_qq$y)$outliers
i <- which(df_qq$y %in% out)
ggplot(df_qq[-i, ]) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
Edit.
Following the OP's comment,
But as far as I understand this function does not look at
the relation between x & y,
maybe the following function is what is needed to remove outliers only if they are outliers in one of the vectors but not in both.
cookOut <- function(X){
out1 <- cook.outliers(X[[1]])$outliers
out2 <- cook.outliers(X[[2]])$outliers
i <- X[[1]] %in% out1
j <- X[[2]] %in% out2
w <- which((!i & j) | (i & !j))
if(length(w)) X[-w, ] else X
}
Test with the second data set, the one in the comment.
The extra vector, id is just to make faceting easier.
df1 <- data.frame(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
df2 <- data.frame(x = c(1,5,8,2,9,6,1,7,32), y = c(1,4,10,1,6,5,2,1,32))
df_qq1 = as.data.frame(qqplot(df1[,1], df1[,2], plot.it=FALSE))
df_qq2 = as.data.frame(qqplot(df2[,1], df2[,2], plot.it=FALSE))
df_qq_out1 <- cookOut(df_qq1)
df_qq_out2 <- cookOut(df_qq2)
df_qq_out1$id <- "A"
df_qq_out2$id <- "B"
df_qq_out <- rbind(df_qq_out1, df_qq_out2)
ggplot(df_qq_out) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1) +
facet_wrap(~ id)

Making surface plot of regression estimates from multiple continuous variables

I have a multi-level model with categorical and continuous variables and splines. Nice and complex. Anyhow I am trying to visualize model fit.
For example, here is some toy data:
library(lme4)
library(rms)
library(gridExtra)
## Make model using sleepstudy data
head(sleepstudy)
# Add some extra vars
sleepstudy$group <- factor( sample(c(1,2), nrow(sleepstudy), replace=TRUE) )
sleepstudy$x1 <- jitter(sleepstudy$Days, factor=5)^2 * jitter(sleepstudy$Reaction)
# Set up a mixed model with spline
fm1 <- lmer(Reaction ~ rcs(Days, 4) * group + (rcs(Days, 4) | Subject), sleepstudy)
# Now add continuous covar
fm2 <- lmer(Reaction ~ rcs(Days, 4) * group + x1 + (rcs(Days, 4) | Subject), sleepstudy)
# Plot fit
new.df <- sleepstudy
new.df$pred1 <- predict(fm1, new.df, allow.new.levels=TRUE, re.form=NA)
new.df$pred2 <- predict(fm2, new.df, allow.new.levels=TRUE, re.form=NA)
g1 <- ggplot(data=new.df, aes(x=Days)) +
geom_line(aes(y=pred1, col=group), size=2) +
ggtitle("Model 1")
g2 <- ggplot(data=new.df, aes(x=Days)) +
geom_line(aes(y=pred2, col=group), size=2) +
ggtitle("Model 2")
grid.arrange(g1, g2, nrow=1)
Plot 1 is smooth, but plot 2 is jagged due to the effect of x1. So I would like to make a surface plot with x = Days, y = x1 and z = pred2 and stratified by group. Not having experience of surface plots I've started out with the wireframe command:
wireframe(pred2 ~ Days * x1, data = new.df[new.df$group==1,],
xlab = "Days", ylab = "x1", zlab="Predicted fit"
)
However although this command does not give an error, my plot is blank:
Questions:
Where am I going wrong with my wireframe?
Is there a better way to visualize my model fit?
I figured out that the data format needed for a wireframe' orplot_ly' surface is that of a 2D matrix of x rows by y columns of corresponding z values (I got a hint towards this from this question Plotly 3d surface graph has incorrect x and y axis values). I also realised I could use `expand.grid' to make a matrix covering the range of possible x and y values and use those to predict z as follows:
days <- 0:9
x1_range <- range(sleepstudy$x1)[2] * c(0.05, 0.1, 0.15, 0.2, 0.25, 0.3)
new.data2 <- expand.grid(Days = days, x1 = x1_range, group = unique(sleepstudy$group) )
new.data2$pred <- predict(fm2, new.data2, allow.new.levels=TRUE, re.form=NA)
I can then stuff those into two different matrices to represent the z-surface for each group in my model:
surf1 <- ( matrix(new.data2[new.data2$group == 1, ]$pred, nrow = length(days), ncol = length(x1_range)) )
surf2 <- ( matrix(new.data2[new.data2$group == 2, ]$pred, nrow = length(days), ncol = length(x1_range)) )
group <- c(rep(1, nrow(surf1)), rep(2, nrow(surf2) ))
Finally I can use plot_ly to plot each surface:
plot_ly (z=surf1, x = mets_range, y = ages, type="surface") %>%
add_surface (z = surf2, surfacecolor=surf2,
color=c('red','yellow'))
The resulting plot:
So the resulting plot is what I wanted (albeit not very useful in this made up example but useful in real data). The only thing I can't figure out is how to show two different color scales. I can suppres the scale altogether but if anyone knows how to show 2 scales for different surfaces do please let me know and I will edit the answer.

ggplot loess line from one dataset over scatterplot of another

The function below calculates binned averages, sizes the bin points on the graph relative to the number of observations in each bin, and plots a lowess line through the bin means. Instead of plotting the lowess line through the bin means, however, I would like to plot the line through the original dataset so that the error bands on the lowess line represent the uncertainty in the actual dataset, not the uncertainty in the binned averages. How do I modify geom_smooth() so that it will plot the line using df instead of dfplot?
library(fields)
library(ggplot2)
binplot <- function(df, yvar, xvar, sub = FALSE, N = 50, size = 40, xlabel = "X", ylabel = "Y"){
if(sub != FALSE){
df <- subset(df, eval(parse(text = sub)))
}
out <- stats.bin(df[,xvar], df[,yvar], N= N)
x <- out$centers
y <- out$stats[ c("mean"),]
n <- out$stats[ c("N"),]
dfplot <- as.data.frame(cbind(x,y,n))
if(size != FALSE){
sizes <- n * (size/max(n))
}else{
sizes = 3
}
ggplot(dfplot, aes(x,y)) +
xlab(xlabel) +
ylab(ylabel) +
geom_point(shape=1, size = sizes) +
geom_smooth()
}
Here is a reproducible example that demonstrates how the function currently works:
sampleSize <- 10000
x1 <- rnorm(n=sampleSize, mean = 0, sd = 4)
y1 <- x1 * 2 + x1^2 * .3 + rnorm(n=sampleSize, mean = 5, sd = 10)
binplot(data.frame(x1,y1), "y1", "x1", N = 25)
As you can see, the error band on the lowess line reflects the uncertainty if each bin had an equal number of observations, but they do not. The bins at the extremes have far fewer obseverations (as illustrated by the size of the points) and the lowess line's error band should reflect that.
You can explicitly set the data= parameter for each layer. You will also need to change the aesthetic mapping since the original data.frame had different column names. Just change your geom_smooth call to
geom_smooth(data=df, aes_string(xvar, yvar))
with the sample data, this returned

Normalising the x scales of overlaying density plots in ggplot

When overlaying ggplot density plots that feature data of same length but different scales is it possible to normalise the x scale for the plots so the densities match up? Alternatively is there a way to normalise the density y scale?
library(ggplot2)
data <- data.frame(x = c('A','B','C','D','E'), y1 = rnorm(100, mean = 0, sd = 1),
y2 = rnorm(100, mean = 0, sd = 50))
p <- ggplot(data)
# Overlaying the density plots is a fail
p + geom_density(aes(x=y1), fill=NA) + geom_density(aes(x=y2), alpha=0.3,col=NA,fill='red')
# You can compress the xscale in the aes() argument:
y1max <- max(data$y1)
y2max <- max(data$y2)
p + geom_density(aes(x=y1), fill=NA) + geom_density(aes(x=y2*y1max/y2max), alpha=0.3,col=NA,fill='red')
# But it doesn't fix the density scale. Any solution?
# And will it work with facet_wrap?
p + geom_density(aes(x=y1), col=NA,fill='grey30') + facet_wrap(~ x, ncol=2)
Thanks!
Does this do what you were hoping for?
p + geom_density(aes(x=scale(y1)), fill=NA) +
geom_density(aes(x=scale(y2)), alpha=0.3,col=NA,fill='red')
The scale function with only a single data argument will center an empiric distribution on 0 and then divide the resulting values by the sample standard deviation so the result has a standard deviation of 1. You can change the defaults for the location and the degree of "compression" or "expansion". You will probably need to investigate putting in appropriate x_scales for y1 and y2. This may take some preprocessing with scale. The scaling factor is recorded in an attribute of the returned object.
attr(scale(data$y2), "scaled:scale")
#[1] 53.21863

Resources