Converting data to percentage rank - r

I have data whose mean and variance changes as a function of the independent variable. How do I convert the dependent variable into (estimated) conditional percentage ranks?
For example, say the data looks like Z below:
library(dplyr)
library(ggplot2)
data.frame(x = runif(1000, 0, 5)) %>%
mutate(y = sin(x) + rnorm(n())*cos(x)/3) ->
Z
we can plot it with Z %>% ggplot(aes(x,y)) + geom_point(): it looks like a disperse sine function, where the variance around that sine function varies with x. My goal is to convert each y value into a number between 0 and 1 which represents its percentage rank for values with similar x. So values very close to that sine function should be converted to about 0.5 while values below it should be converted to values closer to 0 (depending on the variance around that x).
One quick way to do this is to bucket the data and then simply compute the rank of each observation in each bucket.
Another way (which I think is preferable) to do what I ask is to perform a quantile regression for a number of different quantiles (tau):
library(quantreg)
library(splines)
model.fit <- rq(y ~ bs(x, df = 5), tau = (1:9)/10, data = Z)
which can be plotted as follows:
library(tidyr)
data.frame(x = seq(0, 5, len = 100)) %>%
data.frame(., predict(model.fit, newdata = .), check.names = FALSE) %>%
gather(Tau, y, -x) %>%
ggplot(aes(x,y)) +
geom_point(data = Z, size = 0.1) +
geom_line(aes(color = Tau), size = 1)
Given model.fit I could now use the estimated quantiles for each x value to convert each y value into a percentage rank (with the help of approx(...)) but I suspect that package quantreg may do this more easily and better. Is there, in fact, some function in quantreg which automates this?

Related

Stop plotting predictions beyond data limits LME ggpredict Effects

Using the 'iris' dataset (sightly modified as below), I plot the results of an LME.
PLEASE NOTE: I am only using the iris dataset as mock data for the purpose of plotting, so please do not critique the appropriateness of this test. I'm not interested in the statistics, rather the plotting.
Using ggpredict function and plotting the results, the plot extends the predictions beyond the range of the data. Is there a systematic way plot predictions only within the range of each faceted data?
I can plot each facet separately, limit the axis per plot manually, and cowplot them back together, but if there is way to say 'predict only to the max. and min. of the data for that group', this would be great.
Given that these are facets of a single model, perhaps not showing the predictions for different groups is in fact misleading, and I should rather create three different models if I only want predictions within those data subsets?
library(lme4)
library(ggeffects)
library(ggplot2)
data(iris)
glimpse(iris)
df = iris
glimpse(df)
df_ed = df %>% group_by(Species) %>% mutate(Sepal.Length = ifelse(Species == "setosa",Sepal.Length+10,Sepal.Length+0))
df_ed = df_ed %>% group_by(Species) %>% mutate(Sepal.Length = ifelse(Species == "versicolor",Sepal.Length-3,Sepal.Length+0))
glimpse(df_ed)
m_test =
lmer(Sepal.Width ~ Sepal.Length * Species +
(1|Petal.Width),
data = df_ed, REML = T)
summary(m_test)
test_plot = ggpredict(m_test, c("Sepal.Length", "Species"), type = "re") %>% plot(rawdata = T, dot.alpha = 0.6, facet = T, alpha = 0.3)
As per the OP's comment, I think this will provide a solution. In this example, I use data from the sleepstudy dataset that comes with the lme4 package. First, we have to postulate a mixed model, which I generically call fit.
Note that I do not perform any hypothesis test to formally select an appropriate random-effects structure. Of course, this is essential to adequately capture the correlations in the repeated measurements, but falls outside the scope of this post.
library(lme4)
library(splines)
# quantiles of Days
quantile(sleepstudy$Days, c(0.05, 0.95))
# 5% 95%
# 0 9
# mixed model
fit <- lmer(Reaction ~ ns(Days, df = 2, B = c(0, 9)) +
(Days | Subject), data = sleepstudy)
# new data.frame for prediction
ND <- with(sleepstudy, expand.grid(Days = seq(0L, 9L, len = 50)))
Then, we need a fucntion that enables us to obtain predictions from fit for certain values of the covariates. The function effectPlot_lmer() takes the following arguments:
object: a character string indicating the merMod object that was fitted (the mixed model).
ND: a character string indicating the new data.frame, which specifies the values of the covariates for which we want to obtain predictions.
orig_data: a character string specifying the data on which the mixed model was fitted.
# function to obtain predicted reaction times
effectPlot_lmer <- function (object, ND, orig_data) {
form <- formula(object, fixed.only = TRUE)
namesVars <- all.vars(form)
betas <- fixef(object)
V <- vcov(object)
orig_data <- orig_data[complete.cases(orig_data[namesVars]), ]
Terms <- delete.response(terms(form))
mfX <- model.frame(Terms, data = orig_data)
Terms_new <- attr(mfX, "terms")
mfX_new <- model.frame(Terms_new, ND, xlev = .getXlevels(Terms, mfX))
X <- model.matrix(Terms_new, mfX_new)
pred <- c(X %*% betas)
ses <- sqrt(diag(X %*% V %*% t(X)))
ND$pred <- pred
ND$low <- pred - 1.96 * ses
ND$upp <- pred + 1.96 * ses
return(ND)
}
Finally, we can make an effect plot with ggplot.
# effect plot
library(ggplot2)
ggplot(effectPlot_lmer(fit, ND, orig_data = sleepstudy),
aes(x = Days, y = pred)) +
geom_line(size = 1.2, colour = 'blue4') +
geom_ribbon(aes(ymin = low, ymax = upp), colour = NA,
fill = adjustcolor('blue4', 0.2)) +
theme_bw() + ylab('Expected Reaction (ms)')

how to force a polynomial regression to be very flexible for big sharp turns while reducing flexibility for small turns

Is it possible to force a polynomial regression to be very flexible for big sharp turns while reducing flexibility for small turns?
The reason is that I have a variabele y which depends on x for which I am sure there is a positive correlation, although my dataset contains some noise. This is why I do not want a wiggly line for x between 1 and 75 as in the graph below.
library(ggplot2)
library(dplyr)
x <- c(1:100)
y <- c(1,3,6,12,22,15,13,11,5,1,3,6,12,22,11,5,1,3,6,12,22,11,5,5,9,10,1,6,12,22,15,13,11,5,-1,-12,-23,6,12,22,11,5,1,3,6,12,22,11,5,-11,-22,-9,12,22,11,5,9,10,18,1,3,6,12,22,15,13,11,5,-5, -9, -12,6,12,22,11,5,1,3,6,12,22,11,5,1,3,6,12,22,11,5,9,10,18,28,37,50,90,120,150,200)
y <- y + x
df <- data.frame(x, y)
model <- lm(y ~ poly(x, 6, raw = TRUE), data = df)
predictions <- model %>% predict(df)
df <- cbind(df, predictions)
ggplot() +
geom_point(data = df, aes(x = x, y = y), size = 0.1) +
geom_line(data = df, aes(x = x, y = predictions), colour="blue", size=0.1)
I can alter the model to:
model <- lm(y ~ poly(x, 2, raw = TRUE), data = df)
Which gives this graph:
In this case the model is without wigglyness for x between 0 and 90 although it is missing the flexibility to make the turn around x is 90.
I am not looking for a specific solution for this example dataset. I am looking for a solution to have a polynomial regression flexible enough to make sharp (big) turns, although reducing wigglyness for small turns at the same time. (Maybe this can be solved by a limit to make a maximum of n turns?)
I want to use it automated at several datasets. For this reason I do not want to specify different ranges of x for different models.
Thank you!
I have also tried using gam from mgcv, although this gives similar results:
mygam <- gam(y ~ s(x, k = 7), data = df)
mygam <- gam(y ~ s(x, k = 3), data = df)
This graph is based on pmax(p1, p2) where p1 and p2 are two polynomials:

Find all local maxima of a geom_smooth curve in R ggplot?

I need to find all local maxima of a geom_smooth() curve in R. This has been asked in Stack Overflow before:
How can I get the peak and valleys of a geom_smooth line in ggplot2?
But the answer related to finding a single maximum. What if there are multiple local maxima we want to find?
Here's some sample data:
library(tidyverse)
set.seed(404)
df <- data.frame(x = seq(0,4*pi,length.out=1000),
y = sin(seq(0,4*pi,length.out=1000))+rnorm(100,0,1))
df %>% ggplot(aes(x=x,y=y)) +
geom_point() +
geom_smooth()
To find a single maximum, we use the function underlying geom_smooth() in order to get the y values of the curve. This would be either gam() for 1000+ data points or loess() for fewer than 1000. In this case, it's gam() from library(mgcv). To find our maximum is a simple matter of subsetting with which.max(). We can plot the modeled y values over geom_smooth() to confirm they're the same, with our maximum represented by a vertical line:
library(mgcv)
df <- df %>%
mutate(smooth_y = predict(gam(y ~ s(x,bs="cs"),data=df)))
maximum <- df$x[which.max(df$smooth_y)]
df %>% ggplot() +
geom_point(aes(x=x,y=y)) +
geom_smooth(aes(x=x,y=y)) +
geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red") +
geom_vline(xintercept = maximum,color="green")
So far, so good. But, there is more than one maximum here. Maybe we're trying to find the periodicity of the sine wave, measured as the average distance between maxima. How do we make sure we find all maxima in the series?
I am posting my answer below, but I am wondering if there's a more elegant solution than the brute-force method I used.
You can find the points where the difference between subsequent points flips sign using run-length encoding. Note that this method is approximate and relies on x being ordered. You can refine the locations by predicting more closely spaced x-values.
library(tidyverse)
library(mgcv)
set.seed(404)
df <- data.frame(x = seq(0,4*pi,length.out=1000),
y = sin(seq(0,4*pi,length.out=1000))+rnorm(100,0,1))
df <- df %>%
mutate(smooth_y = predict(gam(y ~ s(x,bs="cs"),data=df)))
# Run length encode the sign of difference
rle <- rle(diff(as.vector(df$smooth_y)) > 0)
# Calculate startpoints of runs
starts <- cumsum(rle$lengths) - rle$lengths + 1
# Take the points where the rle is FALSE (so difference goes from positive to negative)
maxima_id <- starts[!rle$values]
# Also convenient, but not in the question:
# minima_id <- starts[rle$values]
maximum <- df$x[maxima_id]
df %>% ggplot() +
geom_point(aes(x=x,y=y)) +
geom_smooth(aes(x=x,y=y)) +
geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red") +
geom_vline(xintercept = maximum,color="green")
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Created on 2020-12-24 by the reprex package (v0.3.0)
I went with a brute force, Monte Carlo method to solve the problem. Using replicate(), we try out 100 random ranges of x and find the maximum y value within each range. We reject maxima that occur at either end of the range. Then we find all unique values of the output vector:
maxima <- replicate(100,{
x_range <- sample(df$x,size=2,replace=FALSE) %>% sort()
max_loc <- df %>%
filter(x >= x_range[1] & x <= x_range[2]) %>%
filter(smooth_y == max(smooth_y)) %>%
pull(x)
if(max_loc == min(x_range)|max_loc == max(x_range)){NA}else{max_loc}
})
unique_maxima <- unique(maxima[!is.na(maxima)])
df %>% ggplot() +
geom_point(aes(x=x,y=y)) +
geom_smooth(aes(x=x,y=y)) +
geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red") +
geom_vline(xintercept = unique_maxima,color="green")

Is it possible to recreate the functionality of bayesplot's "mcmc_areas" plot in ggplot in R

There is a package supported by Stan called bayesplot that can produce nice density area plots with the area under the density curves partitioned based on credibility intervals on the posterior parameter samples drawn through MCMC, this results in a plot that looks like the following:
I am looking to make a similar style of plot given 1D lists of sampled values using ggplot, that I can pass any generic list of values to without it having to be a Stan fit etc. Does anyone know how to do this? The density part is clear via geom_density, but I am struggling with the fill partitioning.
Here's a function that generates a plot similar to bayesplot::mcmc_areas. It plots credible intervals (equal-tailed by default, or highest density) with optional setting of the probability width of the interval:
library(tidyverse)
library(ggridges)
library(bayestestR)
theme_set(theme_classic(base_size=15))
# Create ridgeplots with credible intervals
# ARGUMENTS
# data A data frame
# FUN A function that calculates credible intervals
# ci The width of the credible interval
# ... For passing optional arguments to geom_ridgeline.
# For example, change the scale parameter to control overlap of ridge lines.
# geom_ridgeline's default is scale=1.
plot_density_ridge = function(data, FUN=c("eti", "hdi"), ci=0.89, ...) {
# Determine whether to use eti or hdi function
FUN = match.arg(FUN)
FUN = match.fun(FUN)
# Get kernel density estimate as a data frame
dens = map_df(data, ~ {
d = density(.x, na.rm=TRUE)
tibble(x=d$x, y=d$y)
}, .id="name")
# Set relative width of median line
e = diff(range(dens$x)) * 0.006
# Get credible interval width and median
cred.int = data %>%
pivot_longer(cols=everything()) %>%
group_by(name) %>%
summarise(CI=list(FUN(value, ci=ci)),
m=median(value, na.rm=TRUE)) %>%
unnest_wider(CI)
dens %>%
left_join(cred.int) %>%
ggplot(aes(y=name, x=x, height=y)) +
geom_vline(xintercept=0, colour="grey70") +
geom_ridgeline(data= . %>% group_by(name) %>%
filter(between(x, CI_low, CI_high)),
fill=hcl(230,25,85), ...) +
geom_ridgeline(data=. %>% group_by(name) %>%
filter(between(x, m - e, m + e)),
fill=hcl(240,30,60), ...) +
geom_ridgeline(fill=NA, ...) +
geom_ridgeline(fill=NA, aes(height=0), ...) +
labs(y=NULL, x=NULL)
}
Now let's try out the function
# Fake data
set.seed(2)
d = data.frame(a = rnorm(1000, 0.6, 1),
b = rnorm(1000, 1.3, 0.5),
c = rnorm(1000, -1.2, 0.7))
plot_density_ridge(d)
plot_density_ridge(d, ci=0.5, scale=1.5)
plot_density_ridge(iris %>% select(-Species))
plot_density_ridge(iris %>% select(-Species), FUN="hdi")
Use the ggridges package:
library(tidyverse)
library(ggridges)
tibble(data_1, data_2, data_3) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, y = name, group = name)) +
geom_density_ridges()
Data:
set.seed(123)
n <- 15
data_1 <- rnorm(n)
data_2 <- data_1 - 1
data_3 <- data_1 + 2

aggregate/sum with ggplot

Is there a way to sum data with ggplot2 ?
I want to do a bubble map with the size depending of the sum of z.
Currently I'm doing something like
dd <- ddply(d, .(x,y), transform, z=sum(z))
qplot(x,y, data=dd, size=z)
But I feel I'm writing the same thing twice, I would like to be able to write something
qplot(x,y, data=dd, size=sum(z))
I had a look at stat_sum and stat_summmary but I'm not sure they are appropriate either.
Is it possible to it with ggplot2 ? If not, what would be best way to write those 2 lines.
It can be done using stat_sum within ggplot2. By default, the dot size represents proportions. To get dot size to represent counts, use size = ..n.. as an aesthetic. Counts (and proportions) by a third variable can be obtained by weighting by the third variable (weight = cost) as an aesthetic. Some examples, but first, some data.
library(ggplot2)
set.seed = 321
# Generate somme data
df <- expand.grid(x = seq(1:5), y = seq(1:5), KEEP.OUT.ATTRS = FALSE)
df$Count = sample(1:25, 25, replace = F)
library(plyr)
new <- dlply(df, .(Count), function(data) matrix(rep(matrix(c(data$x, data$y), ncol = 2), data$Count), byrow = TRUE, ncol = 2))
df2 <- data.frame(do.call(rbind, new))
df2$cost <- 1:325
The data contains units categorised according to two factors: X1 and X2; and a third variable which is the cost of each unit.
Plot 1: Plots the proportion of elements at each X1 - X2 combination. group=1 tells ggplot to calculate proportions out of the total number of units in the data frame.
ggplot(df2, aes(factor(X1), factor(X2))) +
stat_sum(aes(group = 1))
Plot 2: Plots the number of elements at each X1 - X2 combination.
ggplot(df2, aes(factor(X1), factor(X2))) +
stat_sum(aes(size = ..n..))
Plot 3: Plots the cost of the elements at each X1 - X2 combination, that is weight by the third variable.
ggplot(df2, aes(x=factor(X1), y=factor(X2))) +
stat_sum(aes(group = 1, weight = cost, size = ..n..))
Plot 4: Plots the proportion of the total cost of all elements in the data frame at each X1 - X2 combination
ggplot(df2, aes(x=factor(X1), y=factor(X2))) +
stat_sum(aes(group = 1, weight = cost))
Plot 5: Plots proportions, but instead of the proportion being out of the total cost across all elements in the data frame, the proportion is out of the cost for elements within each category of X1. That is, within each X1 category, where does the major cost for X2 units occur?
ggplot(df2, aes(x=factor(X1), y=factor(X2))) +
stat_sum(aes(group = X1, weight = cost))
You could put the ddply call into the qplot:
d <- data.frame(x=1:10, y=1:10, z= runif(100))
qplot(x, y, data=ddply(d, .(x,y), transform, z=sum(z)), size=z)
Or use the data.table package.
DT <- data.table(d, key='x,y')
qplot(x, y, data=DT[, sum(z), by='x,y'], size=V1)

Resources