Remove data to the left and right of local minima - r

I have a lot of measurements where I get data that looks something like this:
# Generate example data
x <- 1:100
y <- 100*(1-exp(-0.3*x))
x2 <- 101:200
y2 <- rev(y)
df <- data.frame("x" = c(x, x2),
"y" = c(y, y2))
df$x <- df$x + 50
rm(x, x2, y, y2)
x <- 1:50
y <- 25.91818
x2 <- 251:300
y2 <- 25.91818
df2 <- data.frame("x" = c(x, x2),
"y" = c(y, y2))
rm(x, x2, y, y2)
df <- rbind(df, df2)
rm(df2)
If I plot this I can see that there are left-most and right-most local minima.
library(ggplot2)
p <- ggplot(df, aes(x,y))+
geom_line()+
geom_point(data = data.frame("x" = c(50, 250), "y" = c(25.91818, 25.91818)),
mapping = aes(x, y), colour = "red")+
scale_y_continuous(limits = c(0, 101))
p + annotate("text", label = "minimum 1", x = 50, y = 20) +
annotate("text", label = "minimum 2", x = 250, y = 20)
What I would like to do is trim those data that are to the left of minimum 1 and right of minimum 2. It's not super straightforward as there may also be local minima between those two points, because the real data doesn't look this ideal. I would also need to apply this process to many many samples, but I think this may be trivial because I could use e.g. dplyr and group_by().
I had some luck plotting the local minima using the ggpmisc package, but I'm not sure how I can use that to actually subset my data. Just for clarity I included the code to do so below, and with the real data it looks a little better:
library(ggpmisc)
p2 <- ggplot(df, aes(x, y))+
geom_line()+
ggpmisc::stat_peaks(col="red", span=3)
p2
I hope this is clear and I'm happy to clarify any questions. Thank you in advance.

You could do this using the following steps:
Sort your data according to its x co-ordinates
On your sorted data, find the diff of the y co-ordinates, which will be 0 (or close to 0) for the flat sections at either end (as well as any flat sections in between)
Starting from the left, find the first point where the diff is not zero (or at least is above a minimal threshold). Store this index as a variable called left
Starting from the right, find the first point where the diff is not zero (or at least is above a minimal threshold). Store this index as a variable called right
Subset your data frame so it only contains the data between rows left:right
So, in your example we would have:
# Define a minimal threshold above which we are not at the minimum line
minimal_change <- 1e-6
df <- df[order(df$x),] # Step 1
left <- which(diff(df$y) > minimal_change)[1] # Step 2
right <- nrow(df) - which(diff(rev(df$y)) > minimal_change)[1] + 1 # Step 3
df <- df[left:right, ] # Step 4
Now we can plot the result:
ggplot(df, aes(x, y)) +
geom_line()+
geom_point(data = data.frame("x" = c(50, 250), "y" = c(25.91818, 25.91818)),
mapping = aes(x, y), colour = "red") +
scale_y_continuous(limits = c(0, 101)) +
scale_x_continuous(limits = c(0, 300))

Related

Use a gradient fill under a facet wrap of density curves in ggplot in R?

Similar questions have been asked before in other forms. Some can be found here and here. However, I cant seem to adapt them when using a facet wrap displaying multiple density plots.
I tried adapting the other examples, but failed... I also tried using the ggpattern package, but when there is a large amount of data, it takes several minutes on my machine to create a plot.
I am trying to create a gradient under the density curve... but with the gradient pointing down. Something like in the example image below:
Some example data to work with:
library(ggplot2)
set.seed(321)
# create data
varNames <- c("x1", "x2", "x3")
df <- data.frame(
var = sample(varNames, 100, replace = T),
val = runif(100)
)
# create plot
ggplot(df, aes(x = val)) +
geom_density(aes(colour = var, fill = var)) +
facet_wrap(~var) +
theme_bw() +
theme(legend.position = "none")
You can use teunbrand's function, but you will need to apply it to each facet. Here simply looping over it with lapply
library(tidyverse)
library(polyclip)
#> polyclip 1.10-0 built from Clipper C++ version 6.4.0
## This is teunbrands function copied without any change!!
## from https://stackoverflow.com/a/64695516/7941188
fade_polygon <- function(x, y, n = 100) {
poly <- data.frame(x = x, y = y)
# Create bounding-box edges
yseq <- seq(min(poly$y), max(poly$y), length.out = n)
xlim <- range(poly$x) + c(-1, 1)
# Pair y-edges
grad <- cbind(head(yseq, -1), tail(yseq, -1))
# Add vertical ID
grad <- cbind(grad, seq_len(nrow(grad)))
# Slice up the polygon
grad <- apply(grad, 1, function(range) {
# Create bounding box
bbox <- data.frame(x = c(xlim, rev(xlim)),
y = c(range[1], range[1:2], range[2]))
# Do actual slicing
slice <- polyclip::polyclip(poly, bbox)
# Format as data.frame
for (i in seq_along(slice)) {
slice[[i]] <- data.frame(
x = slice[[i]]$x,
y = slice[[i]]$y,
value = range[3],
id = c(1, rep(0, length(slice[[i]]$x) - 1))
)
}
slice <- do.call(rbind, slice)
})
# Combine slices
grad <- do.call(rbind, grad)
# Create IDs
grad$id <- cumsum(grad$id)
return(grad)
}
## now here starts the change, loop over your variables. I'm creating the data frame directly instead of keeping the density object
dens <- lapply(split(df, df$var), function(x) {
dens <- density(x$val)
data.frame(x = dens$x, y = dens$y)
}
)
## we need this one for the plot, but still need the list
dens_df <- bind_rows(dens, .id = "var")
grad <- bind_rows(lapply(dens, function(x) fade_polygon(x$x, x$y)), .id = "var")
ggplot(grad, aes(x, y)) +
geom_line(data = dens_df) +
geom_polygon(aes(alpha = value, group = id),
fill = "blue") +
facet_wrap(~var) +
scale_alpha_continuous(range = c(0, 1))
Created on 2021-12-05 by the reprex package (v2.0.1)

NA Sawthooth signal

How could be possible to represent (plot and numerically) a sawthooth signal in R from:
y <- c(NA,NA,NA,NA,1,NA,NA,NA,1,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,1,NA)
where 1 represents in y the time points when the sawtooth achieves a peak (obviously to 1). Note that the distance between peaks are unequal.
I thought about using interpolation but maybe it is unnecessary.
Thank you,
You can create a sequence of falling numbers like this:
peaks <- c(0, which(!is.na(y)), length(y))
drop <- -1/max(diff(peaks))
df <- do.call(rbind, lapply(diff(peaks), function(x) {
data.frame(x = c(0, rep(1, x)),
y = c(1, seq(1 + drop, by = drop, length.out = x)))
}))
df$x <- cumsum(df$x)
Which gives this result:
plot(df$x, df$y, type = "l")
Or if you want to be fancy...
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_line(col = "deepskyblue4", size = 1.5) +
theme_bw()
Created on 2020-09-18 by the reprex package (v0.3.0)

Make ggplot with regression line and normal distribution overlay

I am trying to make a plot to show the intuition behind logistic (or probit) regression. How would I make a plot that looks something like this in ggplot?
(Wolf & Best, The Sage Handbook of Regression Analysis and Causal Inference, 2015, p. 155)
Actually, what I would rather even do is have one single normal distribution displayed along the y axis with mean = 0, and a specific variance, so that I can draw horizontal lines going from the linear predictor to the y axis and sideways normal distribution. Something like this:
What this is supposed to show (assuming I haven't misunderstood something) is . I haven't had much success so far...
library(ggplot2)
x <- seq(1, 11, 1)
y <- x*0.5
x <- x - mean(x)
y <- y - mean(y)
df <- data.frame(x, y)
# Probability density function of a normal logistic distribution
pdfDeltaFun <- function(x) {
prob = (exp(x)/(1 + exp(x))^2)
return(prob)
}
# Tried switching the x and y to be able to turn the
# distribution overlay 90 degrees with coord_flip()
ggplot(df, aes(x = y, y = x)) +
geom_point() +
geom_line() +
stat_function(fun = pdfDeltaFun)+
coord_flip()
I think this comes pretty close to the first illustration you give. If this is a thing you don't need to repeat many times, it is probably best to compute the density curves prior to plotting and use a seperate dataframe to plot these.
library(ggplot2)
x <- seq(1, 11, 1)
y <- x*0.5
x <- x - mean(x)
y <- y - mean(y)
df <- data.frame(x, y)
# For every row in `df`, compute a rotated normal density centered at `y` and shifted by `x`
curves <- lapply(seq_len(NROW(df)), function(i) {
mu <- df$y[i]
range <- mu + c(-3, 3)
seq <- seq(range[1], range[2], length.out = 100)
data.frame(
x = -1 * dnorm(seq, mean = mu) + df$x[i],
y = seq,
grp = i
)
})
# Combine above densities in one data.frame
curves <- do.call(rbind, curves)
ggplot(df, aes(x, y)) +
geom_point() +
geom_line() +
# The path draws the curve
geom_path(data = curves, aes(group = grp)) +
# The polygon does the shading. We can use `oob_squish()` to set a range.
geom_polygon(data = curves, aes(y = scales::oob_squish(y, c(0, Inf)),group = grp))
The second illustration is pretty close to your code. I simplified your density function by the standard normal density function and added some extra paramters to stat function:
library(ggplot2)
x <- seq(1, 11, 1)
y <- x*0.5
x <- x - mean(x)
y <- y - mean(y)
df <- data.frame(x, y)
ggplot(df, aes(x, y)) +
geom_point() +
geom_line() +
stat_function(fun = dnorm,
aes(x = after_stat(-y * 4 - 5), y = after_stat(x)),
xlim = range(df$y)) +
# We fill with a polygon, squishing the y-range
stat_function(fun = dnorm, geom = "polygon",
aes(x = after_stat(-y * 4 - 5),
y = after_stat(scales::oob_squish(x, c(-Inf, -1)))),
xlim = range(df$y))

Plot one data frame column against all other columns using ggplots and showing densities in R

I have a data frame with 20 columns, and I want to plot one specific column (called BB) against each single column in the data frame. The plots I need are probability density plots, and I’m using the following code to generate one plot (plotting columns BB vs. AA as an example):
mydata = as.data.frame(fread("filename.txt")) #read my data as data frame
#function to calculate density
get_density <- function(x, y, n = 100) {
dens <- MASS::kde2d(x = x, y = y, n = n)
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
ii <- cbind(ix, iy)
return(dens$z[ii])
}
set.seed(1)
#define the x and y of the plot; x = column called AA; y = column called BB
xy1 <- data.frame(
x = mydata$AA,
y = mydata$BB
)
#call function get_density to calculate density for the defined x an y
xy1$density <- get_density(xy1$x, xy1$y)
#Plot
ggplot(xy1) + geom_point(aes(x, y, color = density), size = 3, pch = 20) + scale_color_viridis() +
labs(title = "BB vs. AA") +
scale_x_continuous(name="AA") +
scale_y_continuous(name="BB")
Would appreciate it if someone can suggest a method to produce multiple plot of BB against every other column, using the above density function and ggplot command. I tried adding a loop, but found it too complicated especially when defining the x and y to be plotted or calling the density function.
Since you don't provide sample data, I'll demo on mtcars. We convert the data to long format, calculate the densities, and make a faceted plot. We plot the mpg column against all others.
library(dplyr)
library(tidyr)
mtlong = gather(mtcars, key = "var", value = "value", -mpg) %>%
group_by(var) %>%
mutate(density = get_density(value, mpg))
ggplot(mtlong, aes(x = value, y = mpg, color = density)) +
geom_point(pch = 20, size = 3) +
labs(x = "") +
facet_wrap(~ var, scales = "free")

geom_errorbar with ecdf in ggplot

I want to create an ecdf plot with two lines and I would like to add errorbars to one of them.
I am using this code
x <- c(16,16,16,16,34,35,38,42,45,1,12)
xError <- c(0,1,1,1,3,3,3,4,5,1,1)
y <- c(16,1,12)
length(x)
length(xError)
length(y)
df <- rbind(data.frame(value = x,name='x'),
data.frame(value = y,name='y'))
ggplot(df, aes(x=value,color=name,linetype=name))+ stat_ecdf()+ geom_errorbar(aes(ymax = x + xError, ymin=x - xError))
The error bar should be added to the x values, but it gives my this error:
Error: Aesthetics must either be length one, or the same length as the dataProblems: x + xError, x - xError
I don't get it - the result is of the same length.
EDIT
I changed to problem, so it gets easier - I thin the real problem is related to ECDF plots and error bars. Take this code as an example:
x <- c(16,16,16,16,34,35,38,42,45,1,12)
xError <- c(0,1,1,1,3,3,3,4,5,1,1)
y <- c(16,1,12)
df <- data.frame(value = x)
ggplot(df, aes(x=value))+ stat_ecdf()+ geom_errorbar(aes(ymax = x + xError, ymin=x - xError))
It prints the error bars, but the plot is completely broken.
there is some similar question here: confidence interval for ecdf
Maybe thats the thing You'll like to archive.
EDIT:
I think this is the thing You'll try to get:
dat2 <- data.frame(variable = x)
dat2 <- transform(dat2, lower = x - xError, upper = x + xError)
l <- ecdf(dat2$lower)
u <- ecdf(dat2$upper)
v <- ecdf(dat2$variable)
dat2$lower1 <- l(dat2$variable)
dat2$upper1 <- u(dat2$variable)
dat2$variable1 <- v(dat2$variable)
ggplot(dat2,aes(x = variable)) +
geom_step(aes(y = variable1)) +
geom_ribbon(aes(ymin = upper1,ymax = lower1),alpha = 0.2)

Resources