Different behaviour lm in stat_smooth - r

In this question someone asked if it is possible change the colour in a ggplot2 plot depending on a linear regression line.
The proposed solution worked, the points have a different colour above and below the plot.
library(ggplot2)
set.seed(2015)
df <- data.frame(x = rnorm(100),
y = rnorm(100))
# Fit linear regression
l = lm(y ~ x, data = df)
# Make new group variable based on residuals
df$group = NA
df$group[which(l$residuals >= 0)] = "above"
df$group[which(l$residuals < 0)] = "below"
# Make the plot
ggplot(df, aes(x,y)) +
geom_point(aes(colour = group)) +
geom_smooth(method = "lm", formula = y ~ x)
But I would like to do regression for y-1. As asked in this question.
# Fit linear regression
l = lm(y - 1 ~ x, data = df)
# Make new group variable based on residuals
df$group = NA
df$group[which(l$residuals >= 0)] = "above"
df$group[which(l$residuals < 0)] = "below"
# Make the plot
ggplot(df, aes(x,y)) +
geom_point(aes(colour = group)) +
geom_smooth(method = "lm", formula = y - 1 ~ x)
This is not what I expected. It looks to me that stat_smooth did what expected. The lm however gives the same result for y ~ x and y - 1 ~ x
What am I missing here?

If you want to color points based on where they lie according to the line, you can try comparing the actual value to the predicted value rather than using the residual
df$group = NA
df$group[df$y>predict(l)] = "above"
df$group[df$y<predict(l)] = "below"

Related

Zig zag lines instead of straight line in linear modeling

Dataset: Here
I am trying to fit a linear model on the above dataset using R.
Here is the code in R:
library(tidyverse)
data <- read.csv("~/Desktop/Salary_Data.csv")
s_data <- data.frame(scale(data))
# Split data into test and train data sets
set.seed(123)
sam <- sample(c(T, F), size = nrow(s_data), replace=T, prob = c(0.8,0.2))
train <- s_data[sam,]
test <- s_data[!sam,]
model_train = lm(YearsExperience~Salary, data=train);
pred <- predict.lm(object = model_train, newdata = test)
pred_train <- predict.lm(model_train, train)
# Trying to plot using ggplot on test dataset.
ggplot() +
geom_point(aes(x = test$YearsExperience, y = test$Salary),
colour = 'red') +
geom_line(aes(x = test$YearsExperience, y = predict.lm(model_train, test)),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')
Output
My understanding is that the simple linear regression model predicts values based on a linear equation of the form ax+b. So y values in geom_line() must fit in a straight line, but in my case, they don't. Why is that happening? Thanks for reading!
It looks like you just have a problem flipping your x and y values. If you plot years of experience on the x axis, it looks like you are trying to use that to predict salary. But your model is backwards. So you can flip the model and get a straight line
model_train = lm(Salary~YearsExperience, data=train);
ggplot(data.frame(test, pred=predict(model_train, newdata = test))) +
geom_point(aes(x = YearsExperience, y = Salary),
colour = 'red') +
geom_line(aes(x = YearsExperience, y = pred),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')
Or you can flip the plot to get a straight line
model_train = lm(YearsExperience~Salary, data=train);
ggplot(data.frame(test, pred=predict(model_train, newdata = test))) +
geom_point(aes(x = Salary, y = YearsExperience),
colour = 'red') +
geom_line(aes(x = Salary, y = pred),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)')

geom_smooth() with median instead of mean

I am building a plot with ggplot. I have data where y is mostly independent of X, but I randomly have a few extreme values of Y at low values of X. Like this:
set.seed(1)
X <- rnorm(500, mean=5)
y <- rnorm(500)
y[X < 3] <- sample(c(0, 1000), size=length(y[X < 3]),prob=c(0.9, 0.1),
replace=TRUE)
I want to make the point that the MEDIAN y-value is still constant over X values. I can see that this is basically true here:
mean(y[X < 3])
median(y[X < 3])
If I make a geom_smooth() plot, it does mean, and is very affected by outliers:
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth()
I have a few potential fixes. For example, I could first use group_by/summarize to make a dataset of binned medians and then plot that. I would rather NOT do this because in my real data I have a lot of facetting and grouping variables, and it would be a lot to keep track of (non-ideal). A lot plot definitely looks better, but log does not have nice interpretation in my application (median does have nice interpretation)
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth() +
scale_y_log10()
Finally, I know about geom_quantile but I think I'm using it wrong. Is there a way to add an error bar? Also- this geom_quantile plot looks way too smooth, and I don't understand why it is sloping down. Am I using it wrong?
ggplot(data=NULL, aes(x=X, y=y)) +
geom_quantile(quantiles=c(0.5))
I realize that this problem probably has a LOT of workarounds, but if possible I would love to use geom_smooth and just provide an argument that tells it to use a median. I want geom_smooth for a side-by-side comparison with consistency. I want to put the mean and median geom_smooths side-by-side to show "hey look, super strong pattern between Y and X is driven by a few large outliers, if we look only at median the pattern disappears".
Thanks!!
You can create your own method to use in geom_smooth. As long as you have a function that produces an object on which the predict generic works to take a data frame with a column called x and translate into appropriate values of y.
As an example, let's create a simple model that interpolates along a running median. We wrap it in its own class and give it its own predict method:
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
Now we can use our method in geom_smooth:
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now of course, this doesn't look very "flat", but it is way flatter than the line calculated by the loess method of the standard geom_smooth() :
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now, I understand that this is not the same thing as "regressing on the median", so you may wish to explore different methods, but if you want to get geom_smooth to plot them, this is how you can go about it. Note that if you want standard errors, you will need to have your predict function return a list with members called fit and se.fit
Here's a modification of #Allan's answer that uses a fixed x window rather than a fixed number of points. This is useful for irregular time series and series with multiple observations at the same time (x value). It uses a loop so it's not very efficient and will be slow for larger data sets.
# running median with time window
library(dplyr)
library(ggplot2)
library(zoo)
# some irregular and skewed data
set.seed(1)
x <- seq(2000, 2020, length.out = 400) # normal time series, gives same result for both methods
x <- sort(rep(runif(40, min = 2000, max = 2020), 10)) # irregular and repeated time series
y <- exp(runif(length(x), min = -1, max = 3))
data <- data.frame(x = x, y = y)
# ggplot(data) + geom_point(aes(x = x, y = y))
# 2 year window
xwindow <- 2
nwindow <- xwindow * length(x) / 20 - 1
# rolling median
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# rolling time window median
rolling_median2 <- function(formula, data, xwindow = 2, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
ys <- rep(NA, length(x)) # for the smoothed y values
xs <- setdiff(unique(x), NA) # the unique x values
i <- 1 # for testing
for (i in seq_along(xs)){
j <- xs[i] - xwindow/2 < x & x < xs[i] + xwindow/2 # x points in this window
ys[x == xs[i]] <- median(y[j], na.rm = TRUE) # y median over this window
}
y <- ys
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed2")
}
predict.rollmed2 <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# plot smooth
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y, colour = "nwindow"), formula = y ~ x, method = "rolling_median", se = FALSE, method.args = list(n_roll = nwindow)) +
geom_smooth(aes(x = x, y = y, colour = "xwindow"), formula = y ~ x, method = "rolling_median2", se = FALSE, method.args = list(xwindow = xwindow))
Created on 2022-01-05 by the reprex package (v2.0.1)

Geom_smooth - linear regression through x-axis intercept

I would like to force a linear regression through a specific x-axis crossing point using "geom_smooth" in ggplot2:
geom_smooth(aes(x = x, y = y), method = "lm", formula = y ~ x)
Intuitively, choosing an x-axis intercept, one would use the formula y = a * (x - b) + c.
Implementing this in the "formula" code as e.g. :
geom_smooth(aes(x = x, y = y), method = "lm", formula = y ~ x - 5)
Does not work.
I am not sure it is possible to do this just using geom_smooth. However, you could predict the regression outside of your ggplot2 call, using an offset to set the intercept required and plot it subsequently.
For example:
set.seed(1)
# Generate some data
x <- 1:10
y <- 3 + 2*x + rnorm(length(x), 0, 2)
# Simple regression
z_1 <- lm(y ~ x)
# Regression with no intercept
z_2 <- lm(y ~ x + 0)
# Regression with intercept at (0,3) - the 'true' intercept
z_3 <- lm(y ~ x + 0, offset=rep(3, length(x)))
# See the coefficients
coef(z_1)
#(Intercept) x
# 2.662353 2.109464
coef(z_2)
# x
#2.4898
coef(z_3)
# x
#1.775515
# Combine into one dataframe
df <- cbind.data.frame(x,y,predict(z_1),predict(z_2), predict(z_3))
# Plot the three regression lines
library(ggplot2)
ggplot(df) + geom_point(aes(x,y)) +
geom_line(aes(x,predict(z_1)), color = "red") +
geom_line(aes(x,predict(z_2)), color = "blue") +
geom_line(aes(x,predict(z_3)), color = "green") +
scale_x_continuous(limits = c(0,10)) +
scale_y_continuous(limits = c(0,30))
You'll need to use the offset function for the x-intercept that's already locked in. That's passed via the method.args argument of geom_smooth, since not all smoothing methods can use that argument.
You'll also need to specify the orientation argument to confirm that you've got an x-intercept, rather than the y-intercept.
I also specified the number of smoothing points to plot (n) and the offset repeats to match -- not sure if that's strictly necessary.
Some gymnastics to be sure, but hopefully it helps.
library("tidyverse")
mtcars %>%
ggplot(aes(disp, hp)) +
geom_point() +
geom_smooth(method = "lm",
orientation = "y",
formula = y ~ x + 0,
color= "blue",
se = FALSE,
n = nrow(mtcars),
method.args=list(offset=rep(100, nrow(mtcars))),
fullrange = TRUE) +
scale_x_continuous(limits =c(0, 600))
#> Warning: Removed 5 rows containing missing values (geom_smooth).
Created on 2020-07-08 by the reprex package (v0.3.0)

Multiple geom_smooth at differing thresholds

I would like to make a plot that has multiple geom_smooth(method="loess") lines for differing thresholds, but I'm having some issues.
Specifically, I want a geom_smooth() line for the all points >1 standard deviation (SD) or < -1 SD (which includes -/+2SD), one for <-2SD and >2SD, and one with all the points together. However, I'm running into an issue where it is only doing the smooth for the data within each category (i.e. greater than 1 SD but less than 2 SD.
I have made some toy data here:
#test data
a <- c(rnorm(10000, mean=0, sd = 1))
b <- c(rnorm(10000, mean=0, sd = 1))
test <- as.data.frame(cbind(a,b))
test3$Thresholds <- cut(test$a, breaks = c(-Inf,-2*sd(test$a),-sd(test$a),0,sd(test$a), 2*sd(test$a), Inf),
labels = c("2_SD+", "1_SD", "0_SD","0_SD", "1_SD", "2_SD+"))
plot <- ggplot(test3, aes(x=b, y=a, color=Thresholds, alpha = 0.25, legend = F)) + geom_point() + geom_smooth(method="loess")
This creates the following plot:
Does anyone have any suggestions?
If you want smoothing done for different quantities of x and y you have to manipulate the data component...
library(ggplot2)
library(dplyr)
#test data
a <- c(rnorm(10000, mean=0, sd = 1))
b <- c(rnorm(10000, mean=0, sd = 1))
test <- as.data.frame(cbind(a,b))
test$Thresholds <- cut(test$a, breaks = c(-Inf,-2*sd(test$a),-sd(test$a),0,sd(test$a), 2*sd(test$a), Inf),
labels = c("2_SD+", "1_SD", "0_SD","0_SD", "1_SD", "2_SD+"))
ggplot(test, aes(x=b, y=a)) +
geom_point() +
# just 2
geom_smooth(data = test %>% filter(Thresholds == "2_SD+"), method="loess") +
# 1 and 2
geom_smooth(data = test %>% filter(Thresholds == "1_SD" | Thresholds == "2_SD+" ), method="loess", color = "yellow") +
#all
geom_smooth(data = test, method="loess", color = "red")
#> `geom_smooth()` using formula 'y ~ x'
#> `geom_smooth()` using formula 'y ~ x'
#> `geom_smooth()` using formula 'y ~ x'

How to plot 3 models in one Figure in R?

I'm new with R and I have fit 3 models for my data as follows:
Model 1: y = a(x) + b
lm1 = lm(data$CBI ~ data$dNDVI)
Model 2: y = a(x)2 + b(x) + c
lm2 <- lm(CBI ~ dNDVI + I(dNDVI^2), data=data)
Model 3: y = x(a|x| + b)–1
lm3 = nls(CBI ~ dNDVI*(a*abs(dNDVI) + b) - 1, start = c(a = 1.5, b = 2.7), data = data)
Now I would like to plot all these three models in R but I could not find the way to do it, can you please help me? I have tried with the first two models as follow and it work but I don't know how to add the Model 3 on it:
ggplot(data = data, aes(x = dNDVI, y = CBI)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, size = 1, se = FALSE) +
geom_smooth(method = lm, formula = y ~ x + I(x^2), size = 1, se = FALSE ) +
theme_bw()
I also would like to add a legend which show 3 different colours or types of lines/curves for the 3 models as well. Can you please guide me how to make it in the figure?
Using iris as a dummy set to represent the three models:
new.dat <- data.frame(Sepal.Length=seq(min(iris$Sepal.Length),
max(iris$Sepal.Length), length.out=50)) #new data.frame to predict the fitted values for each model
m1 <- lm(Petal.Length ~ Sepal.Length, iris)
m2 <- lm(Petal.Length ~ Sepal.Length + I(Sepal.Length^2), data=iris)
m3 <- nls(Petal.Length ~ Sepal.Length*(a*abs(Sepal.Length) + b) - 1,
start = c(a = 1.5, b = 2.7), data = iris)
new.dat$m1.fitted <- predict(m1, new.dat)
new.dat$m2.fitted <- predict(m2, new.dat)
new.dat$m3.fitted <- predict(m3, new.dat)
new.dat <- new.dat %>% gather(var, val, m1.fitted:m3.fitted) #stacked format of fitted data of three models (to automatically generate the legend in ggplot)
ggplot(new.dat, aes(Sepal.Length, val, colour=var)) +
geom_line()

Resources