log linear model in ggplot? - r

The data is from: http://www.principlesofeconometrics.com/poe5/poe5rdata.html, in the file: collegetown.csv
A log linear model is of the form: ln(y) = b1 + b2x
library(ggthemes)
library(ggplot2)
theUrl <- "../poedata/collegetown.csv"
collegetown <- read.csv(theUrl)
g1 <- ggplot(data = collegetown, aes(x = sqft, y = price))+
geom_point(col = "blue")
plot(g1)
logLinearModel <- lm(log(price)~sqft, data = collegetown)
g1 + geom_smooth(method = "lm", formula = y ~ exp(x), se = F, col = "green")+
theme_economist()
summary(logLinearModel)
This gives me the weird plot below:
How do I plot the proper curve? Do I need to store the predicted values explicitly in the data frame?
PS: I want the axis to stay untransformed i.e. in their original scales.

The model y~exp(x) is not the same as the model log(y)~x, so you're not getting the smoother you expect. You can specify that the smoother is a generalised linear model with a log-link function using the code:
g1 <- ggplot(data = collegetown, aes(x = sqft, y = price))+
geom_point(col = "blue")
g1 + geom_smooth(method = "glm", formula = y ~ x, se = F, col = "green",
method.args = list(family=gaussian(link="log"))) +
theme_economist()
which gives what you're wanting. If that doesn't seem intuitive, you can fit the lm outside the plotting with:
logLinearModel <- lm(log(price)~sqft, data = collegetown)
collegetown$pred <- exp(predict(logLinearModel))
ggplot(data = collegetown, aes(x = sqft, y = price))+
geom_point(col = "blue") +
geom_line(aes(y=pred), col = "green")+
theme_economist()
Warning - the two versions aren't the same if you want the standard errors; the first approach gives symmetric errors, the standard errors that you might get from the lm prediction are symmetric on a log scale. See here.

I think a relatively simpler method to build the curve is using stat_function() method.
# LOG LINEAR MODEL
logLinearModel <- lm(log(price)~sqft, data = collegetown)
smodloglinear <- summary(logLinearModel)
logLinearModel
names(logLinearModel)
yn <- exp(logLinearModel$fitted.values)
rgloglinear <- cor(yn, collegetown$price)
rgloglinear^2
b1 <- coef(smod)[[1]]
b2 <- coef(smod)[[2]]
sighat2 <- smod$sigma^2
g2 <- ggplot(data = collegetown,aes(x = sqft, y = price))+
geom_point(col = "white") +
stat_function(fun = function(x){exp(b1+b2*x)}, aes(color = "red"))+
stat_function(fun = function(x){exp(b1+b2*x+sighat2/2)} , aes(color = "green"))+
dark_theme_bw()+
scale_color_identity(name = "Model fit",
breaks = c("red", "green"),
labels = c("yn", "yc"),
guide = "legend")
g2
which gives:

Related

Zig zag lines instead of straight line in linear modeling

Dataset: Here
I am trying to fit a linear model on the above dataset using R.
Here is the code in R:
library(tidyverse)
data <- read.csv("~/Desktop/Salary_Data.csv")
s_data <- data.frame(scale(data))
# Split data into test and train data sets
set.seed(123)
sam <- sample(c(T, F), size = nrow(s_data), replace=T, prob = c(0.8,0.2))
train <- s_data[sam,]
test <- s_data[!sam,]
model_train = lm(YearsExperience~Salary, data=train);
pred <- predict.lm(object = model_train, newdata = test)
pred_train <- predict.lm(model_train, train)
# Trying to plot using ggplot on test dataset.
ggplot() +
geom_point(aes(x = test$YearsExperience, y = test$Salary),
colour = 'red') +
geom_line(aes(x = test$YearsExperience, y = predict.lm(model_train, test)),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')
Output
My understanding is that the simple linear regression model predicts values based on a linear equation of the form ax+b. So y values in geom_line() must fit in a straight line, but in my case, they don't. Why is that happening? Thanks for reading!
It looks like you just have a problem flipping your x and y values. If you plot years of experience on the x axis, it looks like you are trying to use that to predict salary. But your model is backwards. So you can flip the model and get a straight line
model_train = lm(Salary~YearsExperience, data=train);
ggplot(data.frame(test, pred=predict(model_train, newdata = test))) +
geom_point(aes(x = YearsExperience, y = Salary),
colour = 'red') +
geom_line(aes(x = YearsExperience, y = pred),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')
Or you can flip the plot to get a straight line
model_train = lm(YearsExperience~Salary, data=train);
ggplot(data.frame(test, pred=predict(model_train, newdata = test))) +
geom_point(aes(x = Salary, y = YearsExperience),
colour = 'red') +
geom_line(aes(x = Salary, y = pred),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)')

Plotting different models for different x value ranges in ggplot()

I am attempting to display a linear model for low x values and a non-linear model for higher x values. To do this, I will use DNase as an example:
library(ggplot2)
#Assinging DNase as a new dataframe:
data_1 <- DNase
#Creating a column that can distinguish low and high range values:
data_1$range <- ifelse(data_1$conc <5, "low", "high")
#Attempting to plot separate lines for low and high range values, and also facet_wrap by run:
ggplot(data_1, aes(x = conc, y = density, colour = range)) +
geom_point(size = 0.5) + stat_smooth(method = "nls",
method.args = list(formula = y ~ a*exp(b*x),
start = list(a = 0.8, b = 0.1)),
data = data_1,
se = FALSE) +
stat_smooth(method = 'lm', formula = 'y~0+x') +
facet_wrap(~Run)
However, as you can see, it seems to plot both the linear model and the non-linear model for both, and I can't quite figure out where to put information that would tell it to only plot one for each. Also, if possible, can I extend these models out to the full range of values on the x axis?
You can provide specific data to each geom. In this case use subset data_1 using range to only provide the relevant data to each stat_smooth() call (and the whole frame to geom_point()
ggplot(NULL, aes(x = conc, y = density, colour = range)) +
geom_point(data = data_1, size = 0.5) +
stat_smooth(data = subset(data_1, range == "high"),
method = "nls",
method.args = list(formula = y ~ a*exp(b*x),
start = list(a = 0.8, b = 0.1)),
se = FALSE) +
stat_smooth(data = subset(data_1, range == "low"), method = 'lm', formula = 'y~0+x') +
facet_wrap(~Run)
If you want to fit both models on all the data, then just calculate those manually in data_1 and plot manually.

Nonparametric regression ggplot

I'm trying to plot some nonparametric regression curves with ggplot2. I achieved It with the base plot()function:
library(KernSmooth)
set.seed(1995)
X <- runif(100, -1, 1)
G <- X[which (X > 0)]
L <- X[which (X < 0)]
u <- rnorm(100, 0 , 0.02)
Y <- -exp(-20*L^2)-exp(-20*G^2)/(X+1)+u
m <- lm(Y~X)
plot(Y~X)
abline(m, col="red")
m2 <- locpoly(X, Y, bandwidth = 0.05, degree = 0)
lines(m2$x, m2$y, col = "red")
m3 <- locpoly(X, Y, bandwidth = 0.15, degree = 0)
lines(m3$x, m3$y, col = "black")
m4 <- locpoly(X, Y, bandwidth = 0.3, degree = 0)
lines(m4$x, m4$y, col = "green")
legend("bottomright", legend = c("NW(bw=0.05)", "NW(bw=0.15)", "NW(bw=0.3)"),
lty = 1, col = c("red", "black", "green"), cex = 0.5)
With ggplot2 have achieved plotting the linear regression:
With this code:
ggplot(m, aes(x = X, y = Y)) +
geom_point(shape = 1) +
geom_smooth(method = lm, se = FALSE) +
theme(axis.line = element_line(colour = "black", size = 0.25))
But I dont't know how to add the other lines to this plot, as in the base R plot. Any suggestions? Thanks in advance.
Solution
The shortest solution (though not the most beautiful one) is to add the lines using the data= argument of the geom_line function:
ggplot(m, aes(x = X, y = Y)) +
geom_point(shape = 1) +
geom_smooth(method = lm, se = FALSE) +
theme(axis.line = element_line(colour = "black", size = 0.25)) +
geom_line(data = as.data.frame(m2), mapping = aes(x=x,y=y))
Beautiful solution
To get beautiful colors and legend, use
# Need to convert lists to data.frames, ggplot2 needs data.frames
m2 <- as.data.frame(m2)
m3 <- as.data.frame(m3)
m4 <- as.data.frame(m4)
# Colnames are used as names in ggplot legend. Theres nothing wrong in using
# column names which contain symbols or whitespace, you just have to use
# backticks, e.g. m2$`NW(bw=0.05)` if you want to work with them
colnames(m2) <- c("x","NW(bw=0.05)")
colnames(m3) <- c("x","NW(bw=0.15)")
colnames(m4) <- c("x","NW(bw=0.3)")
# To give the different kernel density estimates different colors, they must all be in one data frame.
# For merging to work, all x columns of m2-m4 must be the same!
# the merge function will automatically detec columns of same name
# (that is, x) in m2-m4 and use it to identify y values which belong
# together (to the same x value)
mm <- Reduce(x=list(m2,m3,m4), f=function(a,b) merge(a,b))
# The above line is the same as:
# mm <- merge(m2,m3)
# mm <- merge(mm,m4)
# ggplot needs data in long (tidy) format
mm <- tidyr::gather(mm, kernel, y, -x)
ggplot(m, aes(x = X, y = Y)) +
geom_point(shape = 1) +
geom_smooth(method = lm, se = FALSE) +
theme(axis.line = element_line(colour = "black", size = 0.25)) +
geom_line(data = mm, mapping = aes(x=x,y=y,color=kernel))
Solution which will settle this for everyone and for eternity
The most beautiful and reproducable way though will be to create a custom stat in ggplot2 (see the included stats in ggplot).
There is this vignette of the ggplot2 team to this topic: Extending ggplot2. I have never undertaken such a heroic endeavour though.

How to plot a linear and quadratic model on the same graph?

So I have 2 models for the data set that I am using:
> Bears1Fit1 <- lm(Weight ~ Neck.G)
>
> Bears2Fit2 <- lm(Weight ~ Neck.G + I(Neck.G)^2)
I want to plot these two models on the same scatterplot. I have this so far:
> plot(Neck.G, Weight, pch = c(1), main = "Black Bears Data: Weight Vs Neck Girth", xlab = "Neck Girth (inches) ", ylab = "Weight (pounds)")
> abline(Bears1Fit1)
However, I am unsure of how I should put the quadratic model on the same graph as well. I want to be able to have both lines on the same graph.
Here is an example with cars data set:
data(cars)
make models:
model_lm <- lm(speed ~ dist, data = cars)
model_lm2 <- lm(speed ~ dist + I(dist^2), data = cars)
make new data:
new.data <- data.frame(dist = seq(from = min(cars$dist),
to = max(cars$dist), length.out = 200))
predict:
pred_lm <- predict(model_lm, newdata = new.data)
pred_lm2 <- predict(model_lm2, newdata = new.data)
plot:
plot(speed ~ dist, data = cars)
lines(pred_lm ~ new.data$dist, col = "red")
lines(pred_lm2 ~ new.data$dist, col = "blue")
legend("topleft", c("linear", "quadratic"), col = c("red", "blue"), lty = 1)
with ggplot2
library(ggplot2)
put all data in one data frame and convert to long format using melt from reshape2
preds <- data.frame(new.data,
linear = pred_lm,
quadratic = pred_lm2)
preds <- reshape2::melt(preds,
id.vars = 1)
plot
ggplot(data = preds)+
geom_line(aes(x = dist, y = value, color = variable ))+
geom_point(data = cars, aes(x = dist, y = speed))+
theme_bw()
EDIT: another way using just ggplot2 using two geom_smooth layers, one with the default formula y ~ x (so it need not be specified) and one with a quadratic model formula = y ~ x + I(x^2). In order to get a legend we can specify color within the aes call naming the desired entry as we want it to show in the legend.
ggplot(cars,
aes(x = dist, y = speed)) +
geom_point() +
geom_smooth(method = "lm",
aes(color = "linear"),
se = FALSE) +
geom_smooth(method = "lm",
formula = y ~ x + I(x^2),
aes(color = "quadratic"),
se = FALSE) +
theme_bw()

How do I get the equation for a regression line in log-log plot in ggplot2?

I've a log-log plot, I got the regression line by using:
geom_smooth(formula = y ~ x, method='lm')
But now I'd like to obtain the equation of this line (e.g. y=a*x^(-b)) and print it. I managed to get it in a lin-lin plot but not in this case.
Here's the code:
mydataS<-data.frame(DurPeak_h[],IntPeak[],IntPeakxDurPeak[],ID[]) #df peak
names(mydataS)<-c("x","y","ID","IDEVENT")
plotID<-ggplot(mydataS, aes(x=x, y=y, label=IDEVENT)) +
geom_text(check_overlap = TRUE, hjust = 0, nudge_x = 0.02)+
geom_point(colour="black", size = 2) + geom_point(aes(colour = ID)) +
geom_quantile(quantiles = qs, colour="green")+
scale_colour_gradient(low = "white", high="red") +
scale_x_log10(limits = c(min(DurEnd_h),max(DurEnd_h))) +
scale_y_log10(limits = c(min(IntEnd),max(IntEnd))) +
geom_smooth(formula = y ~ x, method='lm')
ggsave(height=7,"plot.pdf")
mydataS<-data.frame(DurPeak_h[],IntPeak[],IntPeakxDurPeak[],ID[])
names(mydataS)<-c("x","y","ID","IDEVENT")
model <- lm(y~x, header = T)
summary(model)
use the intercept value given as "b" and the coefficient as your "a"
Did it with a workaround: using nls to calculate the two parameters a and b, precisely:
nlsPeak <- coef(nls(y ~ a*(x)^b, data = mydataS, start = list(a=30, b=-0.1)))
then plotting the line with annotate (see some examples here) and finally printing the equation using the function:
power_eqn = function(ds){
m = nls(y ~ a*x^b, start = list(a=30, b=-0.1), data = ds);
eq <- substitute(italic(y) == a ~italic(x)^b,
list(a = format(coef(m)[1], digits = 4),
b = format(coef(m)[2], digits = 2)))
as.character(as.expression(eq));
}
called as follow:
annotate("text",x = 3, y = 180,label = power_eqn(mydataS), parse=TRUE, col="black") +
Hope it helps!

Resources