Linear model of geom_histogram data - r

I'm working with dataset in which I have continuous variable x and categorical variables y and z. Something like this:
set.seed(222)
df = data.frame(x = c(0, c(1:99) + rnorm(99, mean = 0, sd = 0.5), 100),
y = rep(50, times = 101)-(seq(0, 50, by = 0.5))+rnorm(101, mean = 30, sd = 20),
z = rnorm(101, mean = 50, sd= 10))
df$positive.y = sapply(df$y,
function(x){
if (x >= 50){"Yes"} else {"No"}
})
df$positive.z = sapply(df$z,
function(x){
if (x >= 50){"Yes"} else {"No"}
})
Then using this dataset I can create histograms to see either there is correlation between variables x and positive.y(z). With 10 bins it is clear that x correlates with positive.y, but not with positive.z:
ggplot(df,
aes(x = x, fill = positive.y))+
geom_histogram(position = "fill", bins = 10)
ggplot(df,
aes(x = x, fill = positive.z))+
geom_histogram(position = "fill", bins = 10)
Now from this I want two things:
Extract the actual data points to supply them to corr.test() function or something like that.
Add geom_smooth(method = "lm") to plot I have.
I tried to add "bin" column to the df, like this:
df$bin = sapply(df$x,
function(x){
if (x <= 10){1}
else if (x > 10 & <= 20) {20}
else if .......
})
Then using tapply() count number of "Yes" and "No" for each df$bin, and convert it to the %.
But in this case each time I change number of bins at histogram, I have to re-write and re-run this part of code which is tedious and consumes a lot of computer time if dataset is large.
Is there a more straightforward way to achieve the same result?

I don't see a good justification for adding an lm line. Logistic regression is the appropriate model and doesn't require binning:
df$positive.y <- factor(df$positive.y)
mod <- glm(positive.y ~ x, data = df, family = "binomial")
summary(mod)
anova(mod)
library(ggplot2)
ggplot(df,
aes(x = x, fill = positive.y))+
geom_histogram(position = "fill", bins = 10) +
stat_function(fun = function(x) predict(mod, newdata = data.frame(x = x),
type = "response"),
size = 2)
If you need an R² value (why?), there are different pseudo-R² available for GLMs, e.g.,
library(fmsb)
NagelkerkeR2(mod)
#$N
#[1] 101
#
#$R2
#[1] 0.4074274

Related

ggplot2: Plot two different Densities in the same Plot of the same Variable before and after a Cutoff

My goal is to plot two different densities in the same plot of the same variable. I want to do this as it is common to show robustness of the forcing variable (here z) in a Regression Discontinuity Design. In the code below, I got it working however I do not want the density to be plotted before the cutoff (here 0) if it the key is "above" and vice-versa. Also, the graph should not just be hidden because of the smoothing. It should start computing the density just until (or start) the cutoff.
library(ggplot2)
x <- rnorm(1000, mean = 0)
y <- rnorm(500, mean = 2)
z <- append(x,y)
d <- tibble(value = z, key = ifelse(z <= 0, "below", "above"))
ggplot(d) +
geom_density(aes(z, group = key)) +
geom_vline(aes(xintercept = 0))
Does anybody know how to implement this? For linear regressions I got it working, but with geom_density() it plots the other side of the cutoff as well and smoothes it.
Thanks in advance for your help.
You can use trim = TRUE in geom_density to only calculate density over the range of values in the data:
library(ggplot2)
library(dplyr)
x <- rnorm(1000, mean = 0)
y <- rnorm(500, mean = 2)
z <- append(x,y)
d <- tibble(value = z, key = ifelse(z <= 0, "below", "above"))
ggplot(d) +
# added fill for easier discrimination
geom_density(aes(value, group = key, fill = key),
alpha = 0.5, trim = TRUE) +
geom_vline(aes(xintercept = 0), lty = 2, colour = 'red')

How to put plotmath labels in ggplot facets

We often want individual regression equations in ggplot facets. The best way to do this is build the labels in a dataframe and then add them manually. But what if the labels contain plotmath, e.g., superscripts?
Here is a way to do it. The plotmath is converted to a string and then parsed by ggplot. The test_eqn function is taken from another Stackoverflow post, I'll link it when I find it again. Sorry about that.
library(ggplot2)
library(dplyr)
test_eqn <- function(y, x){
m <- lm(log(y) ~ log(x)) # fit y = a * x ^ b in log space
p <- exp(predict(m)) # model prediction of y
eq <- substitute(expression(Y==a~X^~b),
list(
a = format(unname(exp(coef(m)[1])), digits = 3),
b = format(unname(coef(m)[2]), digits = 3)
))
list(eq = as.character(eq)[2], pred = p)
}
set.seed(123)
x <- runif(20)
y <- runif(20)
test_eqn(x,y)$eq
#> [1] "Y == \"0.57\" ~ X^~\"0.413\""
data <- data.frame(x = x,
y = y,
f = sample(c("A","B"), 20, replace = TRUE)) %>%
group_by(f) %>%
mutate(
label = test_eqn(y,x)$eq, # add label
labelx = mean(x),
labely = mean(y),
pred = test_eqn(y,x)$pred # add prediction
)
# plot fits (use slice(1) to avoid multiple copies of labels)
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_line(aes(x = x, y = pred), colour = "red") +
geom_text(data = slice(data, 1), aes(x = labelx, y = labely, label = label), parse = TRUE) +
facet_wrap("f")
Created on 2021-10-20 by the reprex package (v2.0.1)

geom_smooth() with median instead of mean

I am building a plot with ggplot. I have data where y is mostly independent of X, but I randomly have a few extreme values of Y at low values of X. Like this:
set.seed(1)
X <- rnorm(500, mean=5)
y <- rnorm(500)
y[X < 3] <- sample(c(0, 1000), size=length(y[X < 3]),prob=c(0.9, 0.1),
replace=TRUE)
I want to make the point that the MEDIAN y-value is still constant over X values. I can see that this is basically true here:
mean(y[X < 3])
median(y[X < 3])
If I make a geom_smooth() plot, it does mean, and is very affected by outliers:
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth()
I have a few potential fixes. For example, I could first use group_by/summarize to make a dataset of binned medians and then plot that. I would rather NOT do this because in my real data I have a lot of facetting and grouping variables, and it would be a lot to keep track of (non-ideal). A lot plot definitely looks better, but log does not have nice interpretation in my application (median does have nice interpretation)
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth() +
scale_y_log10()
Finally, I know about geom_quantile but I think I'm using it wrong. Is there a way to add an error bar? Also- this geom_quantile plot looks way too smooth, and I don't understand why it is sloping down. Am I using it wrong?
ggplot(data=NULL, aes(x=X, y=y)) +
geom_quantile(quantiles=c(0.5))
I realize that this problem probably has a LOT of workarounds, but if possible I would love to use geom_smooth and just provide an argument that tells it to use a median. I want geom_smooth for a side-by-side comparison with consistency. I want to put the mean and median geom_smooths side-by-side to show "hey look, super strong pattern between Y and X is driven by a few large outliers, if we look only at median the pattern disappears".
Thanks!!
You can create your own method to use in geom_smooth. As long as you have a function that produces an object on which the predict generic works to take a data frame with a column called x and translate into appropriate values of y.
As an example, let's create a simple model that interpolates along a running median. We wrap it in its own class and give it its own predict method:
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
Now we can use our method in geom_smooth:
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now of course, this doesn't look very "flat", but it is way flatter than the line calculated by the loess method of the standard geom_smooth() :
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now, I understand that this is not the same thing as "regressing on the median", so you may wish to explore different methods, but if you want to get geom_smooth to plot them, this is how you can go about it. Note that if you want standard errors, you will need to have your predict function return a list with members called fit and se.fit
Here's a modification of #Allan's answer that uses a fixed x window rather than a fixed number of points. This is useful for irregular time series and series with multiple observations at the same time (x value). It uses a loop so it's not very efficient and will be slow for larger data sets.
# running median with time window
library(dplyr)
library(ggplot2)
library(zoo)
# some irregular and skewed data
set.seed(1)
x <- seq(2000, 2020, length.out = 400) # normal time series, gives same result for both methods
x <- sort(rep(runif(40, min = 2000, max = 2020), 10)) # irregular and repeated time series
y <- exp(runif(length(x), min = -1, max = 3))
data <- data.frame(x = x, y = y)
# ggplot(data) + geom_point(aes(x = x, y = y))
# 2 year window
xwindow <- 2
nwindow <- xwindow * length(x) / 20 - 1
# rolling median
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# rolling time window median
rolling_median2 <- function(formula, data, xwindow = 2, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
ys <- rep(NA, length(x)) # for the smoothed y values
xs <- setdiff(unique(x), NA) # the unique x values
i <- 1 # for testing
for (i in seq_along(xs)){
j <- xs[i] - xwindow/2 < x & x < xs[i] + xwindow/2 # x points in this window
ys[x == xs[i]] <- median(y[j], na.rm = TRUE) # y median over this window
}
y <- ys
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed2")
}
predict.rollmed2 <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# plot smooth
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y, colour = "nwindow"), formula = y ~ x, method = "rolling_median", se = FALSE, method.args = list(n_roll = nwindow)) +
geom_smooth(aes(x = x, y = y, colour = "xwindow"), formula = y ~ x, method = "rolling_median2", se = FALSE, method.args = list(xwindow = xwindow))
Created on 2022-01-05 by the reprex package (v2.0.1)

stat_summary: Including single observations into aggregating function

I would like to "force" an aggregating function in stat_summary to calculate an output value for single observations:
set.seed(1)
value <- c(rep(1:6, each = 3), 7:8)
rel_freq <- sample(x = seq(0, 1, 0.1), size = length(value), replace = TRUE)
example_df <- data.frame(value, rel_freq)
require(ggplot2)
ggplot() +
stat_summary(data = example_df,
mapping = aes(x = as.character(value), y = rel_freq),
fun.data = mean_se)
# Warning message: Removed 2 rows containing missing values (geom_pointrange)
Now what happened here (IMO) is that ggplot removed observations 7 and 8 because the aggregating function in stat_summary doesn't work with single observations? But is there a way to force an output here?
You could write your own little function that extends mean_se to handle the case where the length of x equals 1.
mean_se_tjebo <- function (x, mult = 1) {
x <- stats::na.omit(x)
se <- mult * sqrt(stats::var(x)/length(x))
mean <- mean(x)
if(length(x) != 1) {
data.frame(y = mean, ymin = mean - se, ymax = mean + se)
} else {
data.frame(y = mean, ymin = mean, ymax = mean)
}
}
Now the plot looks as follows
ggplot() +
stat_summary(data = example_df,
mapping = aes(x = as.character(value), y = rel_freq),
fun.data = mean_se_tjebo)

Graph GLM in ggplot2 where x variable is categorical

I need to graph the predicted probabilities of a logit regression in ggplot2. Essentially, I am trying to graph a glm by each treatment condition within the same graph. However, I am getting quite confused about how to do this seeing that my treat variable (i.e. the x I am interested in) is categorical.This means that when I try to graph the treatment effects using ggplot I just get a bunch of points at 0, 1, and 2 but no lines.
My question is... How could I graph the logit prediction lines in this case? Thanks in advance!
set.seed(96)
df <- data.frame(
vote = sample(0:1, 200, replace = T),
treat = sample(0:3, 200, replace = T))
glm_output <- glm(vote ~ as.factor(treat), data = df, family = binomial(link = "logit"))
predicted_vote <- predict(glm_output, newdata = df, type = "link", interval = "confidence", se = TRUE)
df <- cbind(df, data.frame(predicted_vote))
Since the explanatory variable treat is categorical, it will make more sense if you use boxplot instead like the following:
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2)
If you want to see the predicted probabilities by glm across different values of some of other explanatory variables you may try this:
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2) + facet_wrap(~gender)
# create age groups
df$age_group <- cut(df$age, breaks=seq(0,100,20))
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2) + facet_grid(age_group~gender)

Resources