geom_smooth() with median instead of mean - r

I am building a plot with ggplot. I have data where y is mostly independent of X, but I randomly have a few extreme values of Y at low values of X. Like this:
set.seed(1)
X <- rnorm(500, mean=5)
y <- rnorm(500)
y[X < 3] <- sample(c(0, 1000), size=length(y[X < 3]),prob=c(0.9, 0.1),
replace=TRUE)
I want to make the point that the MEDIAN y-value is still constant over X values. I can see that this is basically true here:
mean(y[X < 3])
median(y[X < 3])
If I make a geom_smooth() plot, it does mean, and is very affected by outliers:
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth()
I have a few potential fixes. For example, I could first use group_by/summarize to make a dataset of binned medians and then plot that. I would rather NOT do this because in my real data I have a lot of facetting and grouping variables, and it would be a lot to keep track of (non-ideal). A lot plot definitely looks better, but log does not have nice interpretation in my application (median does have nice interpretation)
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth() +
scale_y_log10()
Finally, I know about geom_quantile but I think I'm using it wrong. Is there a way to add an error bar? Also- this geom_quantile plot looks way too smooth, and I don't understand why it is sloping down. Am I using it wrong?
ggplot(data=NULL, aes(x=X, y=y)) +
geom_quantile(quantiles=c(0.5))
I realize that this problem probably has a LOT of workarounds, but if possible I would love to use geom_smooth and just provide an argument that tells it to use a median. I want geom_smooth for a side-by-side comparison with consistency. I want to put the mean and median geom_smooths side-by-side to show "hey look, super strong pattern between Y and X is driven by a few large outliers, if we look only at median the pattern disappears".
Thanks!!

You can create your own method to use in geom_smooth. As long as you have a function that produces an object on which the predict generic works to take a data frame with a column called x and translate into appropriate values of y.
As an example, let's create a simple model that interpolates along a running median. We wrap it in its own class and give it its own predict method:
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
Now we can use our method in geom_smooth:
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now of course, this doesn't look very "flat", but it is way flatter than the line calculated by the loess method of the standard geom_smooth() :
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now, I understand that this is not the same thing as "regressing on the median", so you may wish to explore different methods, but if you want to get geom_smooth to plot them, this is how you can go about it. Note that if you want standard errors, you will need to have your predict function return a list with members called fit and se.fit

Here's a modification of #Allan's answer that uses a fixed x window rather than a fixed number of points. This is useful for irregular time series and series with multiple observations at the same time (x value). It uses a loop so it's not very efficient and will be slow for larger data sets.
# running median with time window
library(dplyr)
library(ggplot2)
library(zoo)
# some irregular and skewed data
set.seed(1)
x <- seq(2000, 2020, length.out = 400) # normal time series, gives same result for both methods
x <- sort(rep(runif(40, min = 2000, max = 2020), 10)) # irregular and repeated time series
y <- exp(runif(length(x), min = -1, max = 3))
data <- data.frame(x = x, y = y)
# ggplot(data) + geom_point(aes(x = x, y = y))
# 2 year window
xwindow <- 2
nwindow <- xwindow * length(x) / 20 - 1
# rolling median
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# rolling time window median
rolling_median2 <- function(formula, data, xwindow = 2, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
ys <- rep(NA, length(x)) # for the smoothed y values
xs <- setdiff(unique(x), NA) # the unique x values
i <- 1 # for testing
for (i in seq_along(xs)){
j <- xs[i] - xwindow/2 < x & x < xs[i] + xwindow/2 # x points in this window
ys[x == xs[i]] <- median(y[j], na.rm = TRUE) # y median over this window
}
y <- ys
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed2")
}
predict.rollmed2 <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# plot smooth
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y, colour = "nwindow"), formula = y ~ x, method = "rolling_median", se = FALSE, method.args = list(n_roll = nwindow)) +
geom_smooth(aes(x = x, y = y, colour = "xwindow"), formula = y ~ x, method = "rolling_median2", se = FALSE, method.args = list(xwindow = xwindow))
Created on 2022-01-05 by the reprex package (v2.0.1)

Related

How to make a contour ternary represents surface response with R

I want to make a triangle plot represents the response surface for all possible combinations of X, Y and Z factors and the gradient area inside the triangle expresses the predicted distribution of the response variable Gi.
# Here are the data:
X <- rep(c(45,40,55,40,43,50,43,50,43,48), each = 3)
Y <- rep(c(15, 12,22,14,14,19,12,17,17,12 ), each = 3)
Z <- rep(c(15,22,12,12,19,14,14,17,12,17), each = 3)
Gi <- c(353,381,320,312,335,265,394,350,374,320,299,316,300,304,295,360,331,395,
351,280,342,299,303,279,374,364,419,306,290,315)
Ft <- data.frame (X, Y, Z)
# Fitted model:
require (compositions) # package "compositions"
model = lm(Gi ~ ilr (Ft) + I (ilr (Ft)^2) + I (ilr (Ft)^3) )
# Generate random compositional data of the factors X, Y, and Z
library(tmvtnorm)
corMat <- var(Ft)
dt3 <- rtmvnorm (n=500, mean = c(45.2, 15.4, 15.4), sigma = corMat, lower = c(10,5,5), upper = c(80,60,60))
# Predict Gi using the model
pGi <- predict (model, list (Ft = dt3) )
pdt <- cbind (dt3, pGi) %>% as.data.frame() %>%
rename (X = V1, Y = V2, Z = V3)
With the model and predicted data, is it possible to express the estimated pGi as gradient surface in the triangle to get the output like the example enclosed? I have tried with ggtern below, but the output ternary plot is not what I want.
ggtern(data = pdt, aes(x = X, y = Y, z = Z, value = pGi)) +
stat_interpolate_tern(geom="polygon",
formula = value ~ x+y,
method = lm,
aes(fill = ..level..), expand = 1) +
scale_fill_gradient(low="green", high="blue") +
theme_gray () +
theme ( tern.axis.arrow.show = T)

Make ggplot with regression line and normal distribution overlay

I am trying to make a plot to show the intuition behind logistic (or probit) regression. How would I make a plot that looks something like this in ggplot?
(Wolf & Best, The Sage Handbook of Regression Analysis and Causal Inference, 2015, p. 155)
Actually, what I would rather even do is have one single normal distribution displayed along the y axis with mean = 0, and a specific variance, so that I can draw horizontal lines going from the linear predictor to the y axis and sideways normal distribution. Something like this:
What this is supposed to show (assuming I haven't misunderstood something) is . I haven't had much success so far...
library(ggplot2)
x <- seq(1, 11, 1)
y <- x*0.5
x <- x - mean(x)
y <- y - mean(y)
df <- data.frame(x, y)
# Probability density function of a normal logistic distribution
pdfDeltaFun <- function(x) {
prob = (exp(x)/(1 + exp(x))^2)
return(prob)
}
# Tried switching the x and y to be able to turn the
# distribution overlay 90 degrees with coord_flip()
ggplot(df, aes(x = y, y = x)) +
geom_point() +
geom_line() +
stat_function(fun = pdfDeltaFun)+
coord_flip()
I think this comes pretty close to the first illustration you give. If this is a thing you don't need to repeat many times, it is probably best to compute the density curves prior to plotting and use a seperate dataframe to plot these.
library(ggplot2)
x <- seq(1, 11, 1)
y <- x*0.5
x <- x - mean(x)
y <- y - mean(y)
df <- data.frame(x, y)
# For every row in `df`, compute a rotated normal density centered at `y` and shifted by `x`
curves <- lapply(seq_len(NROW(df)), function(i) {
mu <- df$y[i]
range <- mu + c(-3, 3)
seq <- seq(range[1], range[2], length.out = 100)
data.frame(
x = -1 * dnorm(seq, mean = mu) + df$x[i],
y = seq,
grp = i
)
})
# Combine above densities in one data.frame
curves <- do.call(rbind, curves)
ggplot(df, aes(x, y)) +
geom_point() +
geom_line() +
# The path draws the curve
geom_path(data = curves, aes(group = grp)) +
# The polygon does the shading. We can use `oob_squish()` to set a range.
geom_polygon(data = curves, aes(y = scales::oob_squish(y, c(0, Inf)),group = grp))
The second illustration is pretty close to your code. I simplified your density function by the standard normal density function and added some extra paramters to stat function:
library(ggplot2)
x <- seq(1, 11, 1)
y <- x*0.5
x <- x - mean(x)
y <- y - mean(y)
df <- data.frame(x, y)
ggplot(df, aes(x, y)) +
geom_point() +
geom_line() +
stat_function(fun = dnorm,
aes(x = after_stat(-y * 4 - 5), y = after_stat(x)),
xlim = range(df$y)) +
# We fill with a polygon, squishing the y-range
stat_function(fun = dnorm, geom = "polygon",
aes(x = after_stat(-y * 4 - 5),
y = after_stat(scales::oob_squish(x, c(-Inf, -1)))),
xlim = range(df$y))

Linear model of geom_histogram data

I'm working with dataset in which I have continuous variable x and categorical variables y and z. Something like this:
set.seed(222)
df = data.frame(x = c(0, c(1:99) + rnorm(99, mean = 0, sd = 0.5), 100),
y = rep(50, times = 101)-(seq(0, 50, by = 0.5))+rnorm(101, mean = 30, sd = 20),
z = rnorm(101, mean = 50, sd= 10))
df$positive.y = sapply(df$y,
function(x){
if (x >= 50){"Yes"} else {"No"}
})
df$positive.z = sapply(df$z,
function(x){
if (x >= 50){"Yes"} else {"No"}
})
Then using this dataset I can create histograms to see either there is correlation between variables x and positive.y(z). With 10 bins it is clear that x correlates with positive.y, but not with positive.z:
ggplot(df,
aes(x = x, fill = positive.y))+
geom_histogram(position = "fill", bins = 10)
ggplot(df,
aes(x = x, fill = positive.z))+
geom_histogram(position = "fill", bins = 10)
Now from this I want two things:
Extract the actual data points to supply them to corr.test() function or something like that.
Add geom_smooth(method = "lm") to plot I have.
I tried to add "bin" column to the df, like this:
df$bin = sapply(df$x,
function(x){
if (x <= 10){1}
else if (x > 10 & <= 20) {20}
else if .......
})
Then using tapply() count number of "Yes" and "No" for each df$bin, and convert it to the %.
But in this case each time I change number of bins at histogram, I have to re-write and re-run this part of code which is tedious and consumes a lot of computer time if dataset is large.
Is there a more straightforward way to achieve the same result?
I don't see a good justification for adding an lm line. Logistic regression is the appropriate model and doesn't require binning:
df$positive.y <- factor(df$positive.y)
mod <- glm(positive.y ~ x, data = df, family = "binomial")
summary(mod)
anova(mod)
library(ggplot2)
ggplot(df,
aes(x = x, fill = positive.y))+
geom_histogram(position = "fill", bins = 10) +
stat_function(fun = function(x) predict(mod, newdata = data.frame(x = x),
type = "response"),
size = 2)
If you need an R² value (why?), there are different pseudo-R² available for GLMs, e.g.,
library(fmsb)
NagelkerkeR2(mod)
#$N
#[1] 101
#
#$R2
#[1] 0.4074274

R - add centroids to scatter plot

I have a dataset two continuous variables and one factor variable (two classes). I want to create a scatterplot with two centroids (one for each class) that includes error bars in R. The centroids should be positioned at the mean values for x and y for each class.
I can easily create the scatter plot using ggplot2, but I can't figure out how to add the centroids. Is it possible to do this using ggplot / qplot?
Here is some example code:
x <- c(1,2,3,4,5,2,3,5)
y <- c(10,11,14,5,7,9,8,5)
class <- c(1,1,1,0,0,1,0,0)
df <- data.frame(class, x, y)
qplot(x,y, data=df, color=as.factor(class))
Is this what you had in mind?
centroids <- aggregate(cbind(x,y)~class,df,mean)
ggplot(df,aes(x,y,color=factor(class))) +
geom_point(size=3)+ geom_point(data=centroids,size=5)
This creates a separate data frame, centroids, with columns x, y, and class where x and y are the mean values by class. Then we add a second point geometry layer using centroid as the dataset.
This is a slightly more interesting version, useful in cluster analysis.
gg <- merge(df,aggregate(cbind(mean.x=x,mean.y=y)~class,df,mean),by="class")
ggplot(gg, aes(x,y,color=factor(class)))+geom_point(size=3)+
geom_point(aes(x=mean.x,y=mean.y),size=5)+
geom_segment(aes(x=mean.x, y=mean.y, xend=x, yend=y))
EDIT Response to OP's comment.
Vertical and horizontal error bars can be added using geom_errorbar(...) and geom_errorbarh(...).
centroids <- aggregate(cbind(x,y)~class,df,mean)
f <- function(z)sd(z)/sqrt(length(z)) # function to calculate std.err
se <- aggregate(cbind(se.x=x,se.y=y)~class,df,f)
centroids <- merge(centroids,se, by="class") # add std.err column to centroids
ggplot(gg, aes(x,y,color=factor(class)))+
geom_point(size=3)+
geom_point(data=centroids, size=5)+
geom_errorbar(data=centroids,aes(ymin=y-se.y,ymax=y+se.y),width=0.1)+
geom_errorbarh(data=centroids,aes(xmin=x-se.x,xmax=x+se.x),height=0.1)
If you want to calculate, say, 95% confidence instead of std. error, replace
f <- function(z)sd(z)/sqrt(length(z)) # function to calculate std.err
with
f <- function(z) qt(0.025,df=length(z)-1, lower.tail=F)* sd(z)/sqrt(length(z))
I could not get the exact code by #jlhoward to work for me (specifically with the error bars), so I made minor changes to remove errors and even remove warnings. So, you should be able to run the code from start to finish, and if #jlhoward wants to incorporate this into the existing answer, that's great.
centroids <- aggregate(cbind(mean.x = x, mean.y = y) ~ class, df, mean)
gg <- merge(df, centroids, by = "class")
f <- function(z) sd(z) / sqrt(length(z)) # function to calculate std.err
se <- aggregate(cbind(se.x = x ,se.y = y) ~ class, df, f)
centroids <- merge(centroids, se, by = "class") # add std.err column to centroids
ggplot(gg, aes(x = x, y = y, color = factor(class))) +
geom_point(size = 3) +
geom_point(data = centroids, aes(x = mean.x, y = mean.y), size = 5) +
geom_errorbar(data = centroids,
aes(x = mean.x, y = mean.y, ymin = mean.y - se.y, ymax = mean.y + se.y),
width = 0.1) +
geom_errorbarh(data = centroids, inherit.aes=FALSE, # keeps ggplot from using first aes
aes(xmin = (mean.x - se.x), xmax = (mean.x + se.x), y = mean.y,
height = 0.1, color = factor(class))) +
labs(x = "Label for x-axis", y = "Label for y-axis") +
theme(legend.title = element_blank()) # remove legend title

Interpolating a path/curve within R

Within R, I want to interpolate an arbitrary path with constant distance
between interpolated points.
The test-data looks like that:
require("rgdal", quietly = TRUE)
require("ggplot2", quietly = TRUE)
r <- readOGR(".", "line", verbose = FALSE)
coords <- as.data.frame(r#lines[[1]]#Lines[[1]]#coords)
names(coords) <- c("x", "y")
print(coords)
x y
-0.44409 0.551159
-1.06217 0.563326
-1.09867 0.310255
-1.09623 -0.273754
-0.67283 -0.392990
-0.03772 -0.273754
0.63633 -0.015817
0.86506 0.473291
1.31037 0.998899
1.43934 0.933198
1.46854 0.461124
1.39311 0.006083
1.40284 -0.278621
1.54397 -0.271321
p.orig <- ggplot(coords, aes(x = x, y = y)) + geom_path(colour = "red") +
geom_point(colour = "yellow")
print(p.orig)
I tried different methods, none of them were really satisfying:
aspline (akima-package)
approx
bezierCurve
with the tourr-package I couldn't get started
aspline
aspline from the akima-package does some weird stuff when dealing with arbitrary paths:
plotInt <- function(coords) print(p.orig + geom_path(aes(x = x, y = y),
data = coords) + geom_point(aes(x = x, y = y), data = coords))
N <- 50 # 50 points to interpolate
require("akima", quietly = TRUE)
xy.int.ak <- as.data.frame(with(coords, aspline(x = x, y = y, n = N)))
plotInt(xy.int.ak)
approx
xy.int.ax <- as.data.frame(with(coords, list(x = approx(x, n = N)$y,
y = approx(y, n = N)$y)))
plotInt(xy.int.ax)
At first sight, approx looks pretty fine; however, testing it with real data gives me
problems with the distances between the interpolated points. Also a smooth, cubic interpolation would be a nice thing.
bezier
Another approach is to use bezier-curves; I used the following
implementation
source("bez.R")
xy.int.bz <- as.data.frame(with(coords, bezierCurve(x, y, N)))
plotInt(xy.int.bz)
How about regular splines using the same method you used for approx? Will that work on the larger data?
xy.int.sp <- as.data.frame(with(coords, list(x = spline(x)$y,
y = spline(y)$y)))
Consider using xspline or grid.xspline (the first is for base graphics, the second for grid):
plot(x,y, type='b', col='red')
xspline(x,y, shape=1)
You can adjust the shape parameter to change the curve, this example just plots the x spline, but you can also have the function return a set of xy coordinates that you would plot yourself.

Resources