understanding the fundamentals of quantile() and quantiles - r

If there are, say 59 observations, I'm confused why quantile(data, probs = 0.05) would put 3 values under the 5th percentile, as 3/59 = ~0.051.
library(tidyverse)
a <- seq(c(1:59))
b <-rnorm(59)
df <- data.frame(a,b)
df_5thperc <- df %>% summarize(`05%` = quantile(b,
probs=0.05))
y <- mean(df_5thperc$`05%`)
ggplot() + geom_point(data = df, aes(x = a, y = b)) +
geom_hline(yintercept = y, color = "blue")

To expand on #BenBolker, you could consider the type parameter for the quantile() function. You are using a continuous distribution so types 4 through 9 are relevant. For example:
b[b < quantile(b, probs = c(.05), type = 9)]
Types 4 and 6 will give what you were probably expecting
[1] -1.893092 -3.263889
while 5, 7, 8, and 9 will give
[1] -1.893092 -1.538927 -3.263889
The help file gives much detail about why, but in the end it comes down to the fact that there is no agreed upon method to estimate sample quantiles (including the median).

Related

Remove data to the left and right of local minima

I have a lot of measurements where I get data that looks something like this:
# Generate example data
x <- 1:100
y <- 100*(1-exp(-0.3*x))
x2 <- 101:200
y2 <- rev(y)
df <- data.frame("x" = c(x, x2),
"y" = c(y, y2))
df$x <- df$x + 50
rm(x, x2, y, y2)
x <- 1:50
y <- 25.91818
x2 <- 251:300
y2 <- 25.91818
df2 <- data.frame("x" = c(x, x2),
"y" = c(y, y2))
rm(x, x2, y, y2)
df <- rbind(df, df2)
rm(df2)
If I plot this I can see that there are left-most and right-most local minima.
library(ggplot2)
p <- ggplot(df, aes(x,y))+
geom_line()+
geom_point(data = data.frame("x" = c(50, 250), "y" = c(25.91818, 25.91818)),
mapping = aes(x, y), colour = "red")+
scale_y_continuous(limits = c(0, 101))
p + annotate("text", label = "minimum 1", x = 50, y = 20) +
annotate("text", label = "minimum 2", x = 250, y = 20)
What I would like to do is trim those data that are to the left of minimum 1 and right of minimum 2. It's not super straightforward as there may also be local minima between those two points, because the real data doesn't look this ideal. I would also need to apply this process to many many samples, but I think this may be trivial because I could use e.g. dplyr and group_by().
I had some luck plotting the local minima using the ggpmisc package, but I'm not sure how I can use that to actually subset my data. Just for clarity I included the code to do so below, and with the real data it looks a little better:
library(ggpmisc)
p2 <- ggplot(df, aes(x, y))+
geom_line()+
ggpmisc::stat_peaks(col="red", span=3)
p2
I hope this is clear and I'm happy to clarify any questions. Thank you in advance.
You could do this using the following steps:
Sort your data according to its x co-ordinates
On your sorted data, find the diff of the y co-ordinates, which will be 0 (or close to 0) for the flat sections at either end (as well as any flat sections in between)
Starting from the left, find the first point where the diff is not zero (or at least is above a minimal threshold). Store this index as a variable called left
Starting from the right, find the first point where the diff is not zero (or at least is above a minimal threshold). Store this index as a variable called right
Subset your data frame so it only contains the data between rows left:right
So, in your example we would have:
# Define a minimal threshold above which we are not at the minimum line
minimal_change <- 1e-6
df <- df[order(df$x),] # Step 1
left <- which(diff(df$y) > minimal_change)[1] # Step 2
right <- nrow(df) - which(diff(rev(df$y)) > minimal_change)[1] + 1 # Step 3
df <- df[left:right, ] # Step 4
Now we can plot the result:
ggplot(df, aes(x, y)) +
geom_line()+
geom_point(data = data.frame("x" = c(50, 250), "y" = c(25.91818, 25.91818)),
mapping = aes(x, y), colour = "red") +
scale_y_continuous(limits = c(0, 101)) +
scale_x_continuous(limits = c(0, 300))

geom_smooth() with median instead of mean

I am building a plot with ggplot. I have data where y is mostly independent of X, but I randomly have a few extreme values of Y at low values of X. Like this:
set.seed(1)
X <- rnorm(500, mean=5)
y <- rnorm(500)
y[X < 3] <- sample(c(0, 1000), size=length(y[X < 3]),prob=c(0.9, 0.1),
replace=TRUE)
I want to make the point that the MEDIAN y-value is still constant over X values. I can see that this is basically true here:
mean(y[X < 3])
median(y[X < 3])
If I make a geom_smooth() plot, it does mean, and is very affected by outliers:
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth()
I have a few potential fixes. For example, I could first use group_by/summarize to make a dataset of binned medians and then plot that. I would rather NOT do this because in my real data I have a lot of facetting and grouping variables, and it would be a lot to keep track of (non-ideal). A lot plot definitely looks better, but log does not have nice interpretation in my application (median does have nice interpretation)
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth() +
scale_y_log10()
Finally, I know about geom_quantile but I think I'm using it wrong. Is there a way to add an error bar? Also- this geom_quantile plot looks way too smooth, and I don't understand why it is sloping down. Am I using it wrong?
ggplot(data=NULL, aes(x=X, y=y)) +
geom_quantile(quantiles=c(0.5))
I realize that this problem probably has a LOT of workarounds, but if possible I would love to use geom_smooth and just provide an argument that tells it to use a median. I want geom_smooth for a side-by-side comparison with consistency. I want to put the mean and median geom_smooths side-by-side to show "hey look, super strong pattern between Y and X is driven by a few large outliers, if we look only at median the pattern disappears".
Thanks!!
You can create your own method to use in geom_smooth. As long as you have a function that produces an object on which the predict generic works to take a data frame with a column called x and translate into appropriate values of y.
As an example, let's create a simple model that interpolates along a running median. We wrap it in its own class and give it its own predict method:
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
Now we can use our method in geom_smooth:
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now of course, this doesn't look very "flat", but it is way flatter than the line calculated by the loess method of the standard geom_smooth() :
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now, I understand that this is not the same thing as "regressing on the median", so you may wish to explore different methods, but if you want to get geom_smooth to plot them, this is how you can go about it. Note that if you want standard errors, you will need to have your predict function return a list with members called fit and se.fit
Here's a modification of #Allan's answer that uses a fixed x window rather than a fixed number of points. This is useful for irregular time series and series with multiple observations at the same time (x value). It uses a loop so it's not very efficient and will be slow for larger data sets.
# running median with time window
library(dplyr)
library(ggplot2)
library(zoo)
# some irregular and skewed data
set.seed(1)
x <- seq(2000, 2020, length.out = 400) # normal time series, gives same result for both methods
x <- sort(rep(runif(40, min = 2000, max = 2020), 10)) # irregular and repeated time series
y <- exp(runif(length(x), min = -1, max = 3))
data <- data.frame(x = x, y = y)
# ggplot(data) + geom_point(aes(x = x, y = y))
# 2 year window
xwindow <- 2
nwindow <- xwindow * length(x) / 20 - 1
# rolling median
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# rolling time window median
rolling_median2 <- function(formula, data, xwindow = 2, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
ys <- rep(NA, length(x)) # for the smoothed y values
xs <- setdiff(unique(x), NA) # the unique x values
i <- 1 # for testing
for (i in seq_along(xs)){
j <- xs[i] - xwindow/2 < x & x < xs[i] + xwindow/2 # x points in this window
ys[x == xs[i]] <- median(y[j], na.rm = TRUE) # y median over this window
}
y <- ys
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed2")
}
predict.rollmed2 <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# plot smooth
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y, colour = "nwindow"), formula = y ~ x, method = "rolling_median", se = FALSE, method.args = list(n_roll = nwindow)) +
geom_smooth(aes(x = x, y = y, colour = "xwindow"), formula = y ~ x, method = "rolling_median2", se = FALSE, method.args = list(xwindow = xwindow))
Created on 2022-01-05 by the reprex package (v2.0.1)

Linear model of geom_histogram data

I'm working with dataset in which I have continuous variable x and categorical variables y and z. Something like this:
set.seed(222)
df = data.frame(x = c(0, c(1:99) + rnorm(99, mean = 0, sd = 0.5), 100),
y = rep(50, times = 101)-(seq(0, 50, by = 0.5))+rnorm(101, mean = 30, sd = 20),
z = rnorm(101, mean = 50, sd= 10))
df$positive.y = sapply(df$y,
function(x){
if (x >= 50){"Yes"} else {"No"}
})
df$positive.z = sapply(df$z,
function(x){
if (x >= 50){"Yes"} else {"No"}
})
Then using this dataset I can create histograms to see either there is correlation between variables x and positive.y(z). With 10 bins it is clear that x correlates with positive.y, but not with positive.z:
ggplot(df,
aes(x = x, fill = positive.y))+
geom_histogram(position = "fill", bins = 10)
ggplot(df,
aes(x = x, fill = positive.z))+
geom_histogram(position = "fill", bins = 10)
Now from this I want two things:
Extract the actual data points to supply them to corr.test() function or something like that.
Add geom_smooth(method = "lm") to plot I have.
I tried to add "bin" column to the df, like this:
df$bin = sapply(df$x,
function(x){
if (x <= 10){1}
else if (x > 10 & <= 20) {20}
else if .......
})
Then using tapply() count number of "Yes" and "No" for each df$bin, and convert it to the %.
But in this case each time I change number of bins at histogram, I have to re-write and re-run this part of code which is tedious and consumes a lot of computer time if dataset is large.
Is there a more straightforward way to achieve the same result?
I don't see a good justification for adding an lm line. Logistic regression is the appropriate model and doesn't require binning:
df$positive.y <- factor(df$positive.y)
mod <- glm(positive.y ~ x, data = df, family = "binomial")
summary(mod)
anova(mod)
library(ggplot2)
ggplot(df,
aes(x = x, fill = positive.y))+
geom_histogram(position = "fill", bins = 10) +
stat_function(fun = function(x) predict(mod, newdata = data.frame(x = x),
type = "response"),
size = 2)
If you need an R² value (why?), there are different pseudo-R² available for GLMs, e.g.,
library(fmsb)
NagelkerkeR2(mod)
#$N
#[1] 101
#
#$R2
#[1] 0.4074274

Underscore plot in R

Introduction and Current Work Done
[Note: For those interested, I have provided code at the end for reproducing my example.]
I have some data and I have conducted an ANOVA analysis and obtained Tukey's pairwise comparisons:
model1 = aov(trt ~ grp, data = df)
anova(model1)
> TukeyHSD(model1)
diff lwr upr p adj
B-A 0.03481504 -0.40533118 0.4749613 0.9968007
C-A 0.36140489 -0.07874134 0.8015511 0.1448379
D-A 1.53825179 1.09810556 1.9783980 0.0000000
C-B 0.32658985 -0.11355638 0.7667361 0.2166301
D-B 1.50343674 1.06329052 1.9435830 0.0000000
D-C 1.17684690 0.73670067 1.6169931 0.0000000
I can also plot Tukey's pairwise comparisons
> plot(TukeyHSD(model1))
We can see from Tukey's confidence intervals and the plot that A-B, B-C and A-C are not significantly different.
Problem
I have been asked to create something called an "underscore plot" which is described as follows:
We plot the group means on the real line and we draw a line segment between group means to indicate that there is no significant difference between those two particular groups.
Obtaining the means is not difficult:
> aggregate(df$trt ~ df$grp, FUN = mean)
df$grp df$trt
1 A 2.032086
2 B 2.066901
3 C 2.393491
4 D 3.570338
Desired Output
Using the data in this example, the desired plot should appear like the one below:
There is a line segment between the groups that are not significantly different (i.e. a line segment between A-B, B-C and A-C as indicated by Tukey's).
Note: Please note that the plot above is not to scale and it was created in keynote for illustrative purposes only.
Is there a way to get the "underscore plot" described above using R (using either base R or a library such as ggplot2)?
Edit
Here is the code that I used to create the example above:
library(data.table)
set.seed(3)
A = runif(20, 1,3)
A = data.frame(A, rep("A", length(A)))
B = runif(20, 1.25,3.25)
B = data.frame(B, rep("B", length(B)))
C = runif(20, 1.5,3.5)
C = data.frame(C, rep("C", length(C)))
D = runif(20, 2.75,4.25)
D = data.frame(D, rep("D", length(D)))
df = list(A, B, C, D)
df = rbindlist(df)
colnames(df) = c("trt", "grp")
Here's a ggplot version of the underscore plot. We'll load the tidyverse package, which loads ggplot2, dplyr and a few other packages from the tidyverse. We create a data frame of coefficients to plot the group names, coefficient values, and vertical segments and a data frame of non-significant pairs for generating the horizontal underscores.
library(tidyverse)
model1 = aov(trt ~ grp, data=df)
# Get coefficients and label coefficients with names of levels
coefs = coef(model1)
coefs[2:4] = coefs[2:4] + coefs[1]
names(coefs) = levels(model1$model$grp)
# Get non-significant pairs
pairs = TukeyHSD(model1)$grp %>%
as.data.frame() %>%
rownames_to_column(var="pair") %>%
# Keep only non-significant pairs
filter(`p adj` > 0.05) %>%
# Add coefficients to TukeyHSD results
separate(pair, c("pair1","pair2"), sep="-", remove=FALSE) %>%
mutate(start = coefs[match(pair1, names(coefs))],
end = coefs[match(pair2, names(coefs))]) %>%
# Stagger vertical positions of segments
mutate(ypos = seq(-0.03, -0.04, length=3))
# Turn coefs into a data frame
coefs = enframe(coefs, name="grp", value="coef")
ggplot(coefs, aes(x=coef)) +
geom_hline(yintercept=0) +
geom_segment(aes(x=coef, xend=coef), y=0.008, yend=-0.008, colour="blue") +
geom_text(aes(label=grp, y=0.011), size=4, vjust=0) +
geom_text(aes(label=sprintf("%1.2f", coef)), y=-0.01, size=3, angle=-90, hjust=0) +
geom_segment(data=pairs, aes(group=pair, x=start, xend=end, y=ypos, yend=ypos),
colour="red", size=1) +
scale_y_continuous(limits=c(-0.05,0.04)) +
theme_void()
Base R
d1 = data.frame(TukeyHSD(model1)[[1]])
inds = which(sign(d1$lwr) * (d1$upr) <= 0)
non_sig = lapply(strsplit(row.names(d1)[inds], "-"), sort)
d2 = aggregate(df$trt ~ df$grp, FUN=mean)
graphics.off()
windows(width = 400, height = 200)
par("mai" = c(0.2, 0.2, 0.2, 0.2))
plot(d2$`df$trt`, rep(1, NROW(d2)),
xlim = c(min(d2$`df$trt`) - 0.1, max(d2$`df$trt`) + 0.1), lwd = 2,
type = "l",
ann = FALSE, axes = FALSE)
segments(x0 = d2$`df$trt`,
y0 = rep(0.9, NROW(d2)),
x1 = d2$`df$trt`,
y1 = rep(1.1, NROW(d2)),
lwd = 2)
text(x = d2$`df$trt`, y = rep(0.8, NROW(d2)), labels = round(d2$`df$trt`, 2), srt = 90)
text(x = d2$`df$trt`, y = rep(0.75, NROW(d2)), labels = d2$`df$grp`)
lapply(seq_along(non_sig), function(i){
lines(cbind(d2$`df$trt`[match(non_sig[[i]], d2$`df$grp`)], rep(0.9 - 0.01 * i, 2)))
})

R: Determine the threshold that maximally separates two groups based on a continuous variable?

Say I have 200 subjects, 100 in group A and 100 in group B, and for each I measure some continuous parameter.
require(ggplot2)
set.seed(100)
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))
data <- data.frame(value, group)
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
I would like to determine the value (Threshold? Breakpoint?) that maximizes separation and minimizes misclassification between the groups. Does such a function exist in R?
I've tried searching along the lines of "r breakpoint maximal separation between groups," and "r threshold minimize misclassification," but my google-foo seems to be off today.
EDIT:
Responding to #Thomas's comment, I have tried to fit the data using logistic regression and then solve for the threshold, but I haven't gotten very far.
lr <- glm(group~value)
coef(lr)
# (Intercept) value
# 1.1857435 -0.0911762
So Bo = 1.1857435 and B1 = -0.0911762
From Wikipedia, I see that F(x) = 1/(1+e^-(Bo + B1x)), and solving for x:
x = (ln(F(x) / (1 - F(x))) - Bo)/B1
But trying this in R, I get an obviously incorrect answer:
(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497
A simple approach is to write a function that calculates the accuracy given a threshold:
accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))
Then find the maximum using optimize:
optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
#
# $objective
# [1] 0.86
I've gotten the answer I need thanks to help from #Thomas and #BenBolker.
Summary
The problem with my attempt at solving it through logistic regression was that I hadn't specified family = binomial
The dose.p() function in MASS will do the work for me given a glm fit
Code
# Include libraries
require(ggplot2)
require(MASS)
# Set seed
set.seed(100)
# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)
# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]
# See what the probability function looks like
lr <- function(x, b0, b1) {
prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
return(prob)
}
# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
geom_line()
# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
x <- (log(p / (1 - p)) - b0)/b1
return(x)
}
# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)
# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)
Thanks, everyone, for your help!

Resources