Plotting different models for different x value ranges in ggplot() - r

I am attempting to display a linear model for low x values and a non-linear model for higher x values. To do this, I will use DNase as an example:
library(ggplot2)
#Assinging DNase as a new dataframe:
data_1 <- DNase
#Creating a column that can distinguish low and high range values:
data_1$range <- ifelse(data_1$conc <5, "low", "high")
#Attempting to plot separate lines for low and high range values, and also facet_wrap by run:
ggplot(data_1, aes(x = conc, y = density, colour = range)) +
geom_point(size = 0.5) + stat_smooth(method = "nls",
method.args = list(formula = y ~ a*exp(b*x),
start = list(a = 0.8, b = 0.1)),
data = data_1,
se = FALSE) +
stat_smooth(method = 'lm', formula = 'y~0+x') +
facet_wrap(~Run)
However, as you can see, it seems to plot both the linear model and the non-linear model for both, and I can't quite figure out where to put information that would tell it to only plot one for each. Also, if possible, can I extend these models out to the full range of values on the x axis?

You can provide specific data to each geom. In this case use subset data_1 using range to only provide the relevant data to each stat_smooth() call (and the whole frame to geom_point()
ggplot(NULL, aes(x = conc, y = density, colour = range)) +
geom_point(data = data_1, size = 0.5) +
stat_smooth(data = subset(data_1, range == "high"),
method = "nls",
method.args = list(formula = y ~ a*exp(b*x),
start = list(a = 0.8, b = 0.1)),
se = FALSE) +
stat_smooth(data = subset(data_1, range == "low"), method = 'lm', formula = 'y~0+x') +
facet_wrap(~Run)
If you want to fit both models on all the data, then just calculate those manually in data_1 and plot manually.

Related

Zig zag lines instead of straight line in linear modeling

Dataset: Here
I am trying to fit a linear model on the above dataset using R.
Here is the code in R:
library(tidyverse)
data <- read.csv("~/Desktop/Salary_Data.csv")
s_data <- data.frame(scale(data))
# Split data into test and train data sets
set.seed(123)
sam <- sample(c(T, F), size = nrow(s_data), replace=T, prob = c(0.8,0.2))
train <- s_data[sam,]
test <- s_data[!sam,]
model_train = lm(YearsExperience~Salary, data=train);
pred <- predict.lm(object = model_train, newdata = test)
pred_train <- predict.lm(model_train, train)
# Trying to plot using ggplot on test dataset.
ggplot() +
geom_point(aes(x = test$YearsExperience, y = test$Salary),
colour = 'red') +
geom_line(aes(x = test$YearsExperience, y = predict.lm(model_train, test)),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')
Output
My understanding is that the simple linear regression model predicts values based on a linear equation of the form ax+b. So y values in geom_line() must fit in a straight line, but in my case, they don't. Why is that happening? Thanks for reading!
It looks like you just have a problem flipping your x and y values. If you plot years of experience on the x axis, it looks like you are trying to use that to predict salary. But your model is backwards. So you can flip the model and get a straight line
model_train = lm(Salary~YearsExperience, data=train);
ggplot(data.frame(test, pred=predict(model_train, newdata = test))) +
geom_point(aes(x = YearsExperience, y = Salary),
colour = 'red') +
geom_line(aes(x = YearsExperience, y = pred),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')
Or you can flip the plot to get a straight line
model_train = lm(YearsExperience~Salary, data=train);
ggplot(data.frame(test, pred=predict(model_train, newdata = test))) +
geom_point(aes(x = Salary, y = YearsExperience),
colour = 'red') +
geom_line(aes(x = Salary, y = pred),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)')

log linear model in ggplot?

The data is from: http://www.principlesofeconometrics.com/poe5/poe5rdata.html, in the file: collegetown.csv
A log linear model is of the form: ln(y) = b1 + b2x
library(ggthemes)
library(ggplot2)
theUrl <- "../poedata/collegetown.csv"
collegetown <- read.csv(theUrl)
g1 <- ggplot(data = collegetown, aes(x = sqft, y = price))+
geom_point(col = "blue")
plot(g1)
logLinearModel <- lm(log(price)~sqft, data = collegetown)
g1 + geom_smooth(method = "lm", formula = y ~ exp(x), se = F, col = "green")+
theme_economist()
summary(logLinearModel)
This gives me the weird plot below:
How do I plot the proper curve? Do I need to store the predicted values explicitly in the data frame?
PS: I want the axis to stay untransformed i.e. in their original scales.
The model y~exp(x) is not the same as the model log(y)~x, so you're not getting the smoother you expect. You can specify that the smoother is a generalised linear model with a log-link function using the code:
g1 <- ggplot(data = collegetown, aes(x = sqft, y = price))+
geom_point(col = "blue")
g1 + geom_smooth(method = "glm", formula = y ~ x, se = F, col = "green",
method.args = list(family=gaussian(link="log"))) +
theme_economist()
which gives what you're wanting. If that doesn't seem intuitive, you can fit the lm outside the plotting with:
logLinearModel <- lm(log(price)~sqft, data = collegetown)
collegetown$pred <- exp(predict(logLinearModel))
ggplot(data = collegetown, aes(x = sqft, y = price))+
geom_point(col = "blue") +
geom_line(aes(y=pred), col = "green")+
theme_economist()
Warning - the two versions aren't the same if you want the standard errors; the first approach gives symmetric errors, the standard errors that you might get from the lm prediction are symmetric on a log scale. See here.
I think a relatively simpler method to build the curve is using stat_function() method.
# LOG LINEAR MODEL
logLinearModel <- lm(log(price)~sqft, data = collegetown)
smodloglinear <- summary(logLinearModel)
logLinearModel
names(logLinearModel)
yn <- exp(logLinearModel$fitted.values)
rgloglinear <- cor(yn, collegetown$price)
rgloglinear^2
b1 <- coef(smod)[[1]]
b2 <- coef(smod)[[2]]
sighat2 <- smod$sigma^2
g2 <- ggplot(data = collegetown,aes(x = sqft, y = price))+
geom_point(col = "white") +
stat_function(fun = function(x){exp(b1+b2*x)}, aes(color = "red"))+
stat_function(fun = function(x){exp(b1+b2*x+sighat2/2)} , aes(color = "green"))+
dark_theme_bw()+
scale_color_identity(name = "Model fit",
breaks = c("red", "green"),
labels = c("yn", "yc"),
guide = "legend")
g2
which gives:

Fit and plot a Weibull model to a survival data

I want to achieve the exact same thing asked in this question:
How to plot the survival curve generated by survreg (package survival of R)?
Except for the fact that I don't want the data to be stratified by a variable (in the question above it was stratified by sex).
I just want the progression free survival for the whole group of treated patients.
So when I copy the code from the other question, here is where I get stuck:
library(survminer)
library(tidyr)
s <- with(lung,Surv(time,status))
fKM <- survfit(s ~ sex,data=lung)
sWei <- survreg(s ~ as.factor(sex),dist='weibull',data=lung) # in my case here I would replace as.factor(sex) by 1
pred.sex1 = predict(sWei, newdata=list(sex=1),type="quantile",p=seq(.01,.99,by=.01)) #Since I don't want to stratify, what do I do with these 2 lines of code?
pred.sex2 = predict(sWei, newdata=list(sex=2),type="quantile",p=seq(.01,.99,by=.01))
df = data.frame(y=seq(.99,.01,by=-.01), sex1=pred.sex1, sex2=pred.sex2)
df_long = gather(df, key= "sex", value="time", -y)
p = ggsurvplot(fKM, data = lung, risk.table = T)
p$plot = p$plot + geom_line(data=df_long, aes(x=time, y=y, group=sex))
I tried replacing as.factor(sex) by 1 and then the rest of the code just does not make sense, can someone help me with this?
Many thanks in advance!
If you just want to plot the overall empirical survival curve, you might do something like this:
library(survival)
library(survminer)
library(tidyr)
s <- with(lung, Surv(time, status))
fKM <- survfit(s ~ 1, data = survival::lung)
ggsurvplot(fKM, ggtheme = theme_bw())
However, if you want to fit a Weibull model with no predictors, then your formula is fine.
sWei <- survreg(s ~ 1, dist = 'weibull', data = lung)
probs <- seq(0.01, 1, by = 0.01)
time <- predict(sWei, type = "quantile", se = TRUE, p = probs)
The only problem is that time is now a named list of two matrices: fit and se.fit. Both have the same number of rows as lung, but all rows are identical, so we just take one from each and calculate the confidence interval in a data frame which we can then use to create a ggplot:
ggplot(data = data.frame(p = 1 - probs,
time = time$fit[1,],
upper = time$fit[1,] + 1.96 * time$se.fit[1,],
lower = time$fit[1,] - 1.96 * time$se.fit[1,])) +
geom_step(aes(p, time, colour = "All"), size = 1) +
geom_ribbon(aes(p, ymin = lower, ymax = upper, fill = "All"), alpha = 0.2) +
coord_flip(ylim = c(0, 1000)) +
scale_fill_discrete(name = "Strata") +
scale_color_discrete(name = "Strata") +
theme_bw() +
theme(legend.position = "top")
Which we can see looks like a pretty good fit.
If you want both in the same plot you can do something like:
df <- data.frame(p = 1 - probs,
time = time$fit[1,],
upper = time$fit[1,] + 1.96 * time$se.fit[1,],
lower = time$fit[1,] - 1.96 * time$se.fit[1,])
ggsurvplot(fKM, ggtheme = theme_bw())$plot +
geom_line(data = df, aes(time, p), linetype = 2, size = 1) +
geom_line(data = df, aes(upper, p), linetype = 2, size = 1) +
geom_line(data = df, aes(lower, p), linetype = 2, size = 1)
Created on 2020-08-18 by the reprex package (v0.3.0)

How to plot a linear and quadratic model on the same graph?

So I have 2 models for the data set that I am using:
> Bears1Fit1 <- lm(Weight ~ Neck.G)
>
> Bears2Fit2 <- lm(Weight ~ Neck.G + I(Neck.G)^2)
I want to plot these two models on the same scatterplot. I have this so far:
> plot(Neck.G, Weight, pch = c(1), main = "Black Bears Data: Weight Vs Neck Girth", xlab = "Neck Girth (inches) ", ylab = "Weight (pounds)")
> abline(Bears1Fit1)
However, I am unsure of how I should put the quadratic model on the same graph as well. I want to be able to have both lines on the same graph.
Here is an example with cars data set:
data(cars)
make models:
model_lm <- lm(speed ~ dist, data = cars)
model_lm2 <- lm(speed ~ dist + I(dist^2), data = cars)
make new data:
new.data <- data.frame(dist = seq(from = min(cars$dist),
to = max(cars$dist), length.out = 200))
predict:
pred_lm <- predict(model_lm, newdata = new.data)
pred_lm2 <- predict(model_lm2, newdata = new.data)
plot:
plot(speed ~ dist, data = cars)
lines(pred_lm ~ new.data$dist, col = "red")
lines(pred_lm2 ~ new.data$dist, col = "blue")
legend("topleft", c("linear", "quadratic"), col = c("red", "blue"), lty = 1)
with ggplot2
library(ggplot2)
put all data in one data frame and convert to long format using melt from reshape2
preds <- data.frame(new.data,
linear = pred_lm,
quadratic = pred_lm2)
preds <- reshape2::melt(preds,
id.vars = 1)
plot
ggplot(data = preds)+
geom_line(aes(x = dist, y = value, color = variable ))+
geom_point(data = cars, aes(x = dist, y = speed))+
theme_bw()
EDIT: another way using just ggplot2 using two geom_smooth layers, one with the default formula y ~ x (so it need not be specified) and one with a quadratic model formula = y ~ x + I(x^2). In order to get a legend we can specify color within the aes call naming the desired entry as we want it to show in the legend.
ggplot(cars,
aes(x = dist, y = speed)) +
geom_point() +
geom_smooth(method = "lm",
aes(color = "linear"),
se = FALSE) +
geom_smooth(method = "lm",
formula = y ~ x + I(x^2),
aes(color = "quadratic"),
se = FALSE) +
theme_bw()

Use ggplot to plot partial effects obtained with effects library

I would like to use ggplot to replicate the plots partial effects (with partial residuals), as obtained with the "effect" package. To do this I need to retrieve some information.
This is the plot I want to replicate with ggplot.
library(effects)
mod <- lm(log(prestige) ~ income:type + education, data=Prestige)
eff = effect("education", mod, partial.residuals=T)
plot(eff)
From the eff object I am able to retrieve the partial residuals, as eff$residuals, but they are not sufficient to replicate the plot. I think that what I need is the both the residuals, AND the marginal predicted effect. However I was not able to retrieve them from my eff object.
Otherwise I only have the residuals scores that cannot be plotted against the line of the marginal effect.
Any hint on how to retrieve this information?
You have almost all the information available. This would take some more time to generalize, but here's some code that results in a figure approximately like from the effects package. Notice that the smoother is off, but I didn't bother to dig up why.
The code should be self explanatory. I only copied function closest from the package.
mod <- lm(log(prestige) ~ income:type + education, data=Prestige)
eff = effect("education", mod, partial.residuals=T)
library(ggplot2)
library(gridExtra)
closest <- function(x, x0) apply(outer(x, x0, FUN=function(x, x0) abs(x - x0)), 1, which.min)
x.fit <- unlist(eff$x.all)
trans <- I
x <- data.frame(lower = eff$lower, upper = eff$upper, fit = eff$fit, education = eff$x$education)
xy <- data.frame(x = x.fit, y = x$fit[closest(trans(x.fit), x$education)] + eff$residuals)
g <- ggplot(x, aes(x = education, y = fit)) +
theme_bw() +
geom_line(size = 1) +
geom_point(data = xy, aes(x = x, y = y), shape = 1, col = "blue", size = 2) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.5) +
geom_smooth(data = xy, aes(x = trans(x), y = y),
method = "loess", span = 2/3, linetype = "dashed", se = FALSE)
grid.arrange(plot(eff), g, ncol = 2)

Resources