Find all local maxima of a geom_smooth curve in R ggplot? - r

I need to find all local maxima of a geom_smooth() curve in R. This has been asked in Stack Overflow before:
How can I get the peak and valleys of a geom_smooth line in ggplot2?
But the answer related to finding a single maximum. What if there are multiple local maxima we want to find?
Here's some sample data:
library(tidyverse)
set.seed(404)
df <- data.frame(x = seq(0,4*pi,length.out=1000),
y = sin(seq(0,4*pi,length.out=1000))+rnorm(100,0,1))
df %>% ggplot(aes(x=x,y=y)) +
geom_point() +
geom_smooth()
To find a single maximum, we use the function underlying geom_smooth() in order to get the y values of the curve. This would be either gam() for 1000+ data points or loess() for fewer than 1000. In this case, it's gam() from library(mgcv). To find our maximum is a simple matter of subsetting with which.max(). We can plot the modeled y values over geom_smooth() to confirm they're the same, with our maximum represented by a vertical line:
library(mgcv)
df <- df %>%
mutate(smooth_y = predict(gam(y ~ s(x,bs="cs"),data=df)))
maximum <- df$x[which.max(df$smooth_y)]
df %>% ggplot() +
geom_point(aes(x=x,y=y)) +
geom_smooth(aes(x=x,y=y)) +
geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red") +
geom_vline(xintercept = maximum,color="green")
So far, so good. But, there is more than one maximum here. Maybe we're trying to find the periodicity of the sine wave, measured as the average distance between maxima. How do we make sure we find all maxima in the series?
I am posting my answer below, but I am wondering if there's a more elegant solution than the brute-force method I used.

You can find the points where the difference between subsequent points flips sign using run-length encoding. Note that this method is approximate and relies on x being ordered. You can refine the locations by predicting more closely spaced x-values.
library(tidyverse)
library(mgcv)
set.seed(404)
df <- data.frame(x = seq(0,4*pi,length.out=1000),
y = sin(seq(0,4*pi,length.out=1000))+rnorm(100,0,1))
df <- df %>%
mutate(smooth_y = predict(gam(y ~ s(x,bs="cs"),data=df)))
# Run length encode the sign of difference
rle <- rle(diff(as.vector(df$smooth_y)) > 0)
# Calculate startpoints of runs
starts <- cumsum(rle$lengths) - rle$lengths + 1
# Take the points where the rle is FALSE (so difference goes from positive to negative)
maxima_id <- starts[!rle$values]
# Also convenient, but not in the question:
# minima_id <- starts[rle$values]
maximum <- df$x[maxima_id]
df %>% ggplot() +
geom_point(aes(x=x,y=y)) +
geom_smooth(aes(x=x,y=y)) +
geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red") +
geom_vline(xintercept = maximum,color="green")
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Created on 2020-12-24 by the reprex package (v0.3.0)

I went with a brute force, Monte Carlo method to solve the problem. Using replicate(), we try out 100 random ranges of x and find the maximum y value within each range. We reject maxima that occur at either end of the range. Then we find all unique values of the output vector:
maxima <- replicate(100,{
x_range <- sample(df$x,size=2,replace=FALSE) %>% sort()
max_loc <- df %>%
filter(x >= x_range[1] & x <= x_range[2]) %>%
filter(smooth_y == max(smooth_y)) %>%
pull(x)
if(max_loc == min(x_range)|max_loc == max(x_range)){NA}else{max_loc}
})
unique_maxima <- unique(maxima[!is.na(maxima)])
df %>% ggplot() +
geom_point(aes(x=x,y=y)) +
geom_smooth(aes(x=x,y=y)) +
geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red") +
geom_vline(xintercept = unique_maxima,color="green")

Related

Is it possible to recreate the functionality of bayesplot's "mcmc_areas" plot in ggplot in R

There is a package supported by Stan called bayesplot that can produce nice density area plots with the area under the density curves partitioned based on credibility intervals on the posterior parameter samples drawn through MCMC, this results in a plot that looks like the following:
I am looking to make a similar style of plot given 1D lists of sampled values using ggplot, that I can pass any generic list of values to without it having to be a Stan fit etc. Does anyone know how to do this? The density part is clear via geom_density, but I am struggling with the fill partitioning.
Here's a function that generates a plot similar to bayesplot::mcmc_areas. It plots credible intervals (equal-tailed by default, or highest density) with optional setting of the probability width of the interval:
library(tidyverse)
library(ggridges)
library(bayestestR)
theme_set(theme_classic(base_size=15))
# Create ridgeplots with credible intervals
# ARGUMENTS
# data A data frame
# FUN A function that calculates credible intervals
# ci The width of the credible interval
# ... For passing optional arguments to geom_ridgeline.
# For example, change the scale parameter to control overlap of ridge lines.
# geom_ridgeline's default is scale=1.
plot_density_ridge = function(data, FUN=c("eti", "hdi"), ci=0.89, ...) {
# Determine whether to use eti or hdi function
FUN = match.arg(FUN)
FUN = match.fun(FUN)
# Get kernel density estimate as a data frame
dens = map_df(data, ~ {
d = density(.x, na.rm=TRUE)
tibble(x=d$x, y=d$y)
}, .id="name")
# Set relative width of median line
e = diff(range(dens$x)) * 0.006
# Get credible interval width and median
cred.int = data %>%
pivot_longer(cols=everything()) %>%
group_by(name) %>%
summarise(CI=list(FUN(value, ci=ci)),
m=median(value, na.rm=TRUE)) %>%
unnest_wider(CI)
dens %>%
left_join(cred.int) %>%
ggplot(aes(y=name, x=x, height=y)) +
geom_vline(xintercept=0, colour="grey70") +
geom_ridgeline(data= . %>% group_by(name) %>%
filter(between(x, CI_low, CI_high)),
fill=hcl(230,25,85), ...) +
geom_ridgeline(data=. %>% group_by(name) %>%
filter(between(x, m - e, m + e)),
fill=hcl(240,30,60), ...) +
geom_ridgeline(fill=NA, ...) +
geom_ridgeline(fill=NA, aes(height=0), ...) +
labs(y=NULL, x=NULL)
}
Now let's try out the function
# Fake data
set.seed(2)
d = data.frame(a = rnorm(1000, 0.6, 1),
b = rnorm(1000, 1.3, 0.5),
c = rnorm(1000, -1.2, 0.7))
plot_density_ridge(d)
plot_density_ridge(d, ci=0.5, scale=1.5)
plot_density_ridge(iris %>% select(-Species))
plot_density_ridge(iris %>% select(-Species), FUN="hdi")
Use the ggridges package:
library(tidyverse)
library(ggridges)
tibble(data_1, data_2, data_3) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, y = name, group = name)) +
geom_density_ridges()
Data:
set.seed(123)
n <- 15
data_1 <- rnorm(n)
data_2 <- data_1 - 1
data_3 <- data_1 + 2

How to delete outliers from a QQ-plot graph made with ggplot()?

I have a two dimensional dataset (say columns x and y). I use the following function to plot a QQ-plot of this data.
# Creating a toy data for presentation
df = cbind(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
# Plotting the QQ-plot
df_qq = as.data.frame(qqplot(df[,1], df[,2], plot.it=FALSE))
ggplot(df_qq) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
That is the resulting graph:
My question is, how to avoid plotting the last point (i.e. (12,32))? I would rather not delete it manually because i have several of these data pairs and there are similar outliers in each of them. What I would like to do is to write a code that somehow identifies the points that are too far from the 45 degree line and eliminate them from df_qq (for instance if it is 5 times further than the average distance to the 45 line it can be eliminated). My main objective is to make the graph easier to read. When outliers are not eliminated the more regular part of the QQ-plot occupies a too small part of the graph and it prevents me from visually evaluating the similarity of two vectors apart from the outliers.
I would appreciate any help.
There is a CRAN package, referenceIntervals that uses Cook's distance to detect outliers. By applying it to the values of df_qq$y it can then give an index into df_qq to be removed.
library(referenceIntervals)
out <- cook.outliers(df_qq$y)$outliers
i <- which(df_qq$y %in% out)
ggplot(df_qq[-i, ]) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
Edit.
Following the OP's comment,
But as far as I understand this function does not look at
the relation between x & y,
maybe the following function is what is needed to remove outliers only if they are outliers in one of the vectors but not in both.
cookOut <- function(X){
out1 <- cook.outliers(X[[1]])$outliers
out2 <- cook.outliers(X[[2]])$outliers
i <- X[[1]] %in% out1
j <- X[[2]] %in% out2
w <- which((!i & j) | (i & !j))
if(length(w)) X[-w, ] else X
}
Test with the second data set, the one in the comment.
The extra vector, id is just to make faceting easier.
df1 <- data.frame(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
df2 <- data.frame(x = c(1,5,8,2,9,6,1,7,32), y = c(1,4,10,1,6,5,2,1,32))
df_qq1 = as.data.frame(qqplot(df1[,1], df1[,2], plot.it=FALSE))
df_qq2 = as.data.frame(qqplot(df2[,1], df2[,2], plot.it=FALSE))
df_qq_out1 <- cookOut(df_qq1)
df_qq_out2 <- cookOut(df_qq2)
df_qq_out1$id <- "A"
df_qq_out2$id <- "B"
df_qq_out <- rbind(df_qq_out1, df_qq_out2)
ggplot(df_qq_out) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1) +
facet_wrap(~ id)

Converting data to percentage rank

I have data whose mean and variance changes as a function of the independent variable. How do I convert the dependent variable into (estimated) conditional percentage ranks?
For example, say the data looks like Z below:
library(dplyr)
library(ggplot2)
data.frame(x = runif(1000, 0, 5)) %>%
mutate(y = sin(x) + rnorm(n())*cos(x)/3) ->
Z
we can plot it with Z %>% ggplot(aes(x,y)) + geom_point(): it looks like a disperse sine function, where the variance around that sine function varies with x. My goal is to convert each y value into a number between 0 and 1 which represents its percentage rank for values with similar x. So values very close to that sine function should be converted to about 0.5 while values below it should be converted to values closer to 0 (depending on the variance around that x).
One quick way to do this is to bucket the data and then simply compute the rank of each observation in each bucket.
Another way (which I think is preferable) to do what I ask is to perform a quantile regression for a number of different quantiles (tau):
library(quantreg)
library(splines)
model.fit <- rq(y ~ bs(x, df = 5), tau = (1:9)/10, data = Z)
which can be plotted as follows:
library(tidyr)
data.frame(x = seq(0, 5, len = 100)) %>%
data.frame(., predict(model.fit, newdata = .), check.names = FALSE) %>%
gather(Tau, y, -x) %>%
ggplot(aes(x,y)) +
geom_point(data = Z, size = 0.1) +
geom_line(aes(color = Tau), size = 1)
Given model.fit I could now use the estimated quantiles for each x value to convert each y value into a percentage rank (with the help of approx(...)) but I suspect that package quantreg may do this more easily and better. Is there, in fact, some function in quantreg which automates this?

Custom ggplot2 shaded error areas on categorical line plot

I'm trying to plot a line, smoothed by loess, but I'm trying to figure out how to include shaded error areas defined by existing variables, but also smoothed.
This code creates example data:
set.seed(12345)
data <- cbind(rep("A", 100), rnorm(100, 0, 1))
data <- rbind(data, cbind(rep("B", 100), rnorm(100, 5, 1)))
data <- rbind(data, cbind(rep("C", 100), rnorm(100, 10, 1)))
data <- rbind(data, cbind(rep("D", 100), rnorm(100, 15, 1)))
data <- cbind(rep(1:100, 4), data)
data <- data.frame(data)
names(data) <- c("num", "category", "value")
data$num <- as.numeric(data$num)
data$value <- as.numeric(data$value)
data$upper <- data$value+0.20
data$lower <- data$value-0.30
Plotting the data below, this is what I get:
ggplot(data, aes(x=num, y=value, colour=category)) +
stat_smooth(method="loess", se=F)
What I'd like is a plot that looks like the following, except with the upper and lower bounds of the shaded areas being bounded by smoothed lines of the "upper" and "lower" variables in the generated data.
Any help would be greatly appreciated.
Here's one way to add smoothed versions of upper and lower. We'll add LOESS predictions for upper and lower to the data frame and then plot those using geom_ribbon. It would be more elegant if this could all be done within the call to ggplot. That's probably possible by feeding a special-purpose function to stat_summary, and hopefully someone else will post an answer using that approach.
# Expand the scale of the upper and lower values so that the difference
# is visible in the plot
data$upper = data$value + 10
data$lower = data$value - 10
# Order data by category and num
data = data[order(data$category, data$num),]
# Create LOESS predictions for the values of upper and lower
# and add them to the data frame. I'm sure there's a better way to do this,
# but my attempts with dplyr and tapply both failed, so I've resorted to the clunky
# method below.
data$upperLoess = unlist(lapply(LETTERS[1:4],
function(x) predict(loess(data$upper[data$category==x] ~
data$num[data$category==x]))))
data$lowerLoess = unlist(lapply(LETTERS[1:4],
function(x) predict(loess(data$lower[data$category==x] ~
data$num[data$category==x]))))
# Use geom_ribbon to add a prediction band bounded by the LOESS predictions for
# upper and lower
ggplot(data, aes(num, value, colour=category, fill=category)) +
geom_smooth(method="loess", se=FALSE) +
geom_ribbon(aes(x=num, y=value, ymax=upperLoess, ymin=lowerLoess),
alpha=0.2)
And here's the result:

How to plot a contour line showing where 95% of values fall within, in R and in ggplot2

Say we have:
x <- rnorm(1000)
y <- rnorm(1000)
How do I use ggplot2 to produce a plot containing the two following geoms:
The bivariate expectation of the two series of values
A contour line showing where 95% of the estimates fall within?
I know how to do the first part:
df <- data.frame(x=x, y=y)
p <- ggplot(df, aes(x=x, y=y))
p <- p + xlim(-10, 10) + ylim(-10, 10) # say
p <- p + geom_point(x=mean(x), y=mean(y))
And I also know about the stat_contour() and stat_density2d() functions within ggplot2.
And I also know that there are 'bins' options within stat_contour.
However, I guess what I need is something like the probs argument within quantile, but over two dimensions rather than one.
I have also seen a solution within the graphics package. However, I would like to do this within ggplot.
Help much appreciated,
Jon
Unfortunately, the accepted answer currently fails with Error: Unknown parameters: breaks on ggplot2 2.1.0. I cobbled together an alternative approach based on the code in this answer, which uses the ks package for computing the kernel density estimate:
library(ggplot2)
set.seed(1001)
d <- data.frame(x=rnorm(1000),y=rnorm(1000))
kd <- ks::kde(d, compute.cont=TRUE)
contour_95 <- with(kd, contourLines(x=eval.points[[1]], y=eval.points[[2]],
z=estimate, levels=cont["5%"])[[1]])
contour_95 <- data.frame(contour_95)
ggplot(data=d, aes(x, y)) +
geom_point() +
geom_path(aes(x, y), data=contour_95) +
theme_bw()
Here's the result:
TIP: The ks package depends on the rgl package, which can be a pain to compile manually. Even if you're on Linux, it's much easier to get a precompiled version, e.g. sudo apt install r-cran-rgl on Ubuntu if you have the appropriate CRAN repositories set up.
Riffing off of Ben Bolker's answer, a solution that can handle multiple levels and works with ggplot 2.2.1:
library(ggplot2)
library(MASS)
library(reshape2)
# create data:
set.seed(8675309)
Sigma <- matrix(c(0.1,0.3,0.3,4),2,2)
mv <- data.frame(mvrnorm(4000,c(1.5,16),Sigma))
# get the kde2d information:
mv.kde <- kde2d(mv[,1], mv[,2], n = 400)
dx <- diff(mv.kde$x[1:2]) # lifted from emdbook::HPDregionplot()
dy <- diff(mv.kde$y[1:2])
sz <- sort(mv.kde$z)
c1 <- cumsum(sz) * dx * dy
# specify desired contour levels:
prob <- c(0.95,0.90,0.5)
# plot:
dimnames(mv.kde$z) <- list(mv.kde$x,mv.kde$y)
dc <- melt(mv.kde$z)
dc$prob <- approx(sz,1-c1,dc$value)$y
p <- ggplot(dc,aes(x=Var1,y=Var2))+
geom_contour(aes(z=prob,color=..level..),breaks=prob)+
geom_point(aes(x=X1,y=X2),data=mv,alpha=0.1,size=1)
print(p)
The result:
This works, but is quite inefficient because you actually have to compute the kernel density estimate three times.
set.seed(1001)
d <- data.frame(x=rnorm(1000),y=rnorm(1000))
getLevel <- function(x,y,prob=0.95) {
kk <- MASS::kde2d(x,y)
dx <- diff(kk$x[1:2])
dy <- diff(kk$y[1:2])
sz <- sort(kk$z)
c1 <- cumsum(sz) * dx * dy
approx(c1, sz, xout = 1 - prob)$y
}
L95 <- getLevel(d$x,d$y)
library(ggplot2); theme_set(theme_bw())
ggplot(d,aes(x,y)) +
stat_density2d(geom="tile", aes(fill = ..density..),
contour = FALSE)+
stat_density2d(colour="red",breaks=L95)
(with help from http://comments.gmane.org/gmane.comp.lang.r.ggplot2/303)
update: with a recent version of ggplot2 (2.1.0) it doesn't seem possible to pass breaks to stat_density2d (or at least I don't know how), but the method below with geom_contour still seems to work ...
You can make things a little more efficient by computing the kernel density estimate once and plotting the tiles and contours from the same grid:
kk <- with(dd,MASS::kde2d(x,y))
library(reshape2)
dimnames(kk$z) <- list(kk$x,kk$y)
dc <- melt(kk$z)
ggplot(dc,aes(x=Var1,y=Var2))+
geom_tile(aes(fill=value))+
geom_contour(aes(z=value),breaks=L95,colour="red")
doing the 95% level computation from the kk grid (to reduce the number of kernel computations to 1) is left as an exercise
I'm not sure why stat_density2d(geom="tile") and geom_tile give slightly different results (the former is smoothed)
I haven't added the bivariate mean, but something like annotate("point",x=mean(d$x),y=mean(d$y),colour="red") should work.
I had an example where the MASS::kde2d() bandwidth specifications were not flexible enough, so I ended up using the ks package and the ks::kde() function and, as an example, the ks::Hscv() function to estimate flexible bandwidths that captured the smoothness better. This computation can be a bit slow, but it has much better performance in some situations. Here is a version of the above code for that example:
set.seed(1001)
d <- data.frame(x=rnorm(1000),y=rnorm(1000))
getLevel <- function(x,y,prob=0.95) {
kk <- MASS::kde2d(x,y)
dx <- diff(kk$x[1:2])
dy <- diff(kk$y[1:2])
sz <- sort(kk$z)
c1 <- cumsum(sz) * dx * dy
approx(c1, sz, xout = 1 - prob)$y
}
L95 <- getLevel(d$x,d$y)
library(ggplot2); theme_set(theme_bw())
ggplot(d,aes(x,y)) +
stat_density2d(geom="tile", aes(fill = ..density..),
contour = FALSE)+
stat_density2d(colour="red",breaks=L95)
## using ks::kde
hscv1 <- Hscv(d)
fhat <- ks::kde(d, H=hscv1, compute.cont=TRUE)
dimnames(fhat[['estimate']]) <- list(fhat[["eval.points"]][[1]],
fhat[["eval.points"]][[2]])
library(reshape2)
aa <- melt(fhat[['estimate']])
ggplot(aa, aes(x=Var1, y=Var2)) +
geom_tile(aes(fill=value)) +
geom_contour(aes(z=value), breaks=fhat[["cont"]]["50%"], color="red") +
geom_contour(aes(z=value), breaks=fhat[["cont"]]["5%"], color="purple")
For this particular example, the differences are minimal, but in an example where the bandwidth specification requires more flexibility, this modification may be important. Note that the 95% contour is specified using the breaks=fhat[["cont"]]["5%"], which I found a little bit counter-intuitive, because it is called here the "5% contour".
Just mixing answers from above, putting them in a more tidyverse friendly way, and allowing for multiple contour levels. I use here geom_path(group=probs), adding them manually geom_text. Another approach is to use geom_path(colour=probs) which will automatically label the contours as legend.
library(ks)
library(tidyverse)
set.seed(1001)
## data
d <- MASS::mvrnorm(1000, c(0, 0.2), matrix(c(1, 0.4, 1, 0.4), ncol=2)) %>%
magrittr::set_colnames(c("x", "y")) %>%
as_tibble()
## density function
kd <- ks::kde(d, compute.cont=TRUE, h=0.2)
## extract results
get_contour <- function(kd_out=kd, prob="5%") {
contour_95 <- with(kd_out, contourLines(x=eval.points[[1]], y=eval.points[[2]],
z=estimate, levels=cont[prob])[[1]])
as_tibble(contour_95) %>%
mutate(prob = prob)
}
dat_out <- map_dfr(c("10%", "20%","80%", "90%"), ~get_contour(kd, .)) %>%
group_by(prob) %>%
mutate(n_val = 1:n()) %>%
ungroup()
## clean kde output
kd_df <- expand_grid(x=kd$eval.points[[1]], y=kd$eval.points[[2]]) %>%
mutate(z = c(kd$estimate %>% t))
ggplot(data=kd_df, aes(x, y)) +
geom_tile(aes(fill=z)) +
geom_point(data = d, alpha = I(0.4), size = I(0.4), colour = I("yellow")) +
geom_path(aes(x, y, group = prob),
data=filter(dat_out, !n_val %in% 1:3), colour = I("white")) +
geom_text(aes(label = prob), data =
filter(dat_out, (prob%in% c("10%", "20%","80%") & n_val==1) | (prob%in% c("90%") & n_val==20)),
colour = I("black"), size =I(3))+
scale_fill_viridis_c()+
theme_bw() +
theme(legend.position = "none")
Created on 2019-06-25 by the reprex package (v0.3.0)

Resources