R: Adding regression line and R^2 in QQplot - r

Currently, I'm working building a regression model for the prediction of various financial variables. After data collection, I'm now looking into qqplots to determine differences in distributions of the independent variables. The picture added is an example of a result obtained using a QQplot.
Now I need to add a regression line and calculate the R^2, to determine which data transformation (log, quadratic, ...) I should apply.
This is de line that creates the QQplot:
qqplot(Xvar, Yvar, plot.it = TRUE, xlab = Xvar, ylab = Yvar, xlim=c(-5, 30),
ylim=c(0, 100000000), main = QQPlot)
Can you, please, help me with this?
Thank you!

Related

Can one use lm() to hold other variables at their mean and plot the regression line on a bivariate scatter plot?

I'm just wondering if it is possible to plot the regression line for a particular variable from a multivariate regression (using lm()), while holding other variables at their mean, and fit the regression line on a bivariate scatter plot?
There's a function called cplot() from margins package that will do the trick but the function doesn't seem to be able to include data points on the X-Y plane. So I am wondering if someone has tried using lm() function to do this?
library(ISLR)
data(Carseats)
lm.fit <- lm(Sales ~ Income + Advertising + Price, data = Carseats)
plot(Carseats$Income, Carseats$Sales, cex = 1.3, col = "red", pch = 19, main = "The relationship between Car Sales and Income", xlab = "Income", ylab = "Car Sales")
Is it possible to fit lm.fit on the bivariate plot along the dimension of Sales and Income, while holding other variables at their means?
You can compute this line without much trouble.
You already have the slope built into lm.fit.
The only issue is what should you use as the intercept?
We want to look at the plane where Advertising and Price
are held at their means. We want the point where the
regression line in this plane crosses the Sales-axis,
so the intercept will be the value where
Advertising = mean(Carseats$Advertising)
Price = mean(Carseats$Price)
and
Income = 0
so just use your model to compute that value.
## Assuming that you made your plot as in the question
IntPoint = data.frame(Income = 0,
Advertising = mean(Carseats$Advertising),
Price = mean(Carseats$Price))
Int2 = predict(lm.fit, IntPoint)
abline(Int2, lm.fit$coefficients[2])

Why abline won't show line from glm with Gamma family?

I have the following data, which I'm trying to model via GLM, using Gamma function. It works, except that abline won't show any line. What am I doing wrong?
y <- c(0.00904977380111,0.009174311972687,0.022573363475789,0.081632653008122,0.005571030584803,1e-04,0.02375296916921,0.004962779106823,0.013729977117333,0.00904977380111,0.004514672640982,0.016528925619835,1e-04,0.027855153258277,0.011834319585449,0.024999999936719,1e-04,0.026809651528869,0.016348773841071,1e-04,0.009345794439034,0.00457665899303,0.004705882305772,0.023201856194357,1e-04,0.033734939711656,0.014251781472007,0.004662004755245,0.009259259166667,0.056872037917387,0.018518518611111,0.014598540145986,0.009478673032951,0.023529411811211,0.004819277060357,0.018691588737881,0.018957345923721,0.005390835525461,0.056179775223141,0.016348773841071,0.01104972381185,0.010928961639344,1e-04,1e-04,0.010869565271444,0.011363636420778,0.016085790883856,0.016,0.005665722322786,0.01117318441372,0.028818443860841,1e-04,0.022988505862069,0.01010101,1e-04,0.018083182676638,0.00904977380111,0.00961538466323,0.005390835525461,0.005763688703004,1e-04,0.005571030584803,1e-04,0.014388489208633,0.005633802760722,0.005633802760722,1e-04,0.005361930241431,0.005698005811966,0.013986013986014,1e-04,1e-04)
x <- c(600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,744.47,744.47,744.47,744.47,744.47,744.47,744.47,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42)
hist(y,breaks=15)
plot(y~x)
fit <- glm(y~x,family='Gamma'(link='log'))
abline(fit)
abline plots linear functions, from a simple linear regression, say. A GLM with a Gamma family and a log link is nonlinear on the original scale. To visualize the fit of such a model, you could use predict (an example is given below). Several packages (e.g. effects or visreg) for R exist that feature functions that allow you to directly plot the fit on the original scale including confidence intervals.
Here is an example using visreg using your data and model:
library(visreg)
y <- c(0.00904977380111,0.009174311972687,0.022573363475789,0.081632653008122,0.005571030584803,1e-04,0.02375296916921,0.004962779106823,0.013729977117333,0.00904977380111,0.004514672640982,0.016528925619835,1e-04,0.027855153258277,0.011834319585449,0.024999999936719,1e-04,0.026809651528869,0.016348773841071,1e-04,0.009345794439034,0.00457665899303,0.004705882305772,0.023201856194357,1e-04,0.033734939711656,0.014251781472007,0.004662004755245,0.009259259166667,0.056872037917387,0.018518518611111,0.014598540145986,0.009478673032951,0.023529411811211,0.004819277060357,0.018691588737881,0.018957345923721,0.005390835525461,0.056179775223141,0.016348773841071,0.01104972381185,0.010928961639344,1e-04,1e-04,0.010869565271444,0.011363636420778,0.016085790883856,0.016,0.005665722322786,0.01117318441372,0.028818443860841,1e-04,0.022988505862069,0.01010101,1e-04,0.018083182676638,0.00904977380111,0.00961538466323,0.005390835525461,0.005763688703004,1e-04,0.005571030584803,1e-04,0.014388489208633,0.005633802760722,0.005633802760722,1e-04,0.005361930241431,0.005698005811966,0.013986013986014,1e-04,1e-04)
x <- c(600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,744.47,744.47,744.47,744.47,744.47,744.47,744.47,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42)
fit <- glm(y~x,family='Gamma'(link='log'))
visreg(fit, scale = "response")
An here is the example using R base graphics and predict:
pred_frame <- data.frame(
x = seq(min(x), max(x), length.out = 1000)
)
pred_frame$fit <- predict(fit, newdata = pred_frame, type = "response")
plot(y~x, pch = 16, las = 1, cex = 1.5)
lines(fit~x, data = pred_frame, col = "steelblue", lwd = 3)
You are not being consistent here since you chose to model on the log scale but you are plotting on the raw scale. Mind you many, many published plots do the same. You need to plot the points in log space or transform the coefficients and pass them to abline() explicitly.

Plotting a 95% confidence interval for a lm object

How can I calculate and plot a confidence interval for my regression in r? So far I have two numerical vectors of equal length (x,y) and a regression object(lm.out). I have made a scatterplot of y given x and added the regression line to this plot. I am looking for a way to add a 95% prediction confidence band for lm.out to the plot. I've tried using the predict function, but I don't even know where to start with that :/. Here is my code at the moment:
x=c(1,2,3,4,5,6,7,8,9,0)
y=c(13,28,43,35,96,84,101,110,108,13)
lm.out <- lm(y ~ x)
plot(x,y)
regression.data = summary(lm.out) #save regression summary as variable
names(regression.data) #get names so we can index this data
a= regression.data$coefficients["(Intercept)","Estimate"] #grab values
b= regression.data$coefficients["x","Estimate"]
abline(a,b) #add the regression line
Thank you!
Edit: I've taken a look at the proposed duplicate and can't quite get to the bottom of it.
You have yo use predict for a new vector of data, here newx.
x=c(1,2,3,4,5,6,7,8,9,0)
y=c(13,28,43,35,96,84,101,110,108,13)
lm.out <- lm(y ~ x)
newx = seq(min(x),max(x),by = 0.05)
conf_interval <- predict(lm.out, newdata=data.frame(x=newx), interval="confidence",
level = 0.95)
plot(x, y, xlab="x", ylab="y", main="Regression")
abline(lm.out, col="lightblue")
lines(newx, conf_interval[,2], col="blue", lty=2)
lines(newx, conf_interval[,3], col="blue", lty=2)
EDIT
as it is mention in the coments by Ben this can be done with matlines as follow:
plot(x, y, xlab="x", ylab="y", main="Regression")
abline(lm.out, col="lightblue")
matlines(newx, conf_interval[,2:3], col = "blue", lty=2)
I'm going to add a tip that would have saved me a lot of frustration when trying the method given by #Alejandro Andrade: If your data are in a data frame, then when you build your model with lm(), use the data= argument rather than $ notation. E.g., use
lm.out <- lm(y ~ x, data = mydata)
rather than
lm.out <- lm(mydata$y ~ mydata$x)
If you do the latter, then this statement
predict(lm.out, newdata=data.frame(x=newx), interval="confidence", level = 0.95)
seems to either ignore the new values passed using newdata= or there's a silent error. Either way, the output is the predictions from the original data, not the new data.
Also, be sure your x variable gets the same name in the new data frame that it had
in the original. That's easier to figure out because you do get an error, but knowing it ahead of time might save you a round of debugging.
Note: Tried to add this as a comment, but don't have enough reputation points.

How to get only the plots from gam.check

When applying gam.check in the mgcv package, R produces some residual plots and basis dimension output. Is there a way to only produce the plots and not the printed output?
library(mgcv)
set.seed(0)
dat <- gamSim(1,n=200)
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3), data=dat)
plot(b, pages=1)
gam.check(b, pch=19, cex=.3)
There are four plots, from top left, moving down and across we have:
A QQ plot of the residuals
A histogram of the residuals
A plot of residuals vs the linear predictor
A plot of observed vs fitted values.
In the code below, I assume b contains your fitted model, as per your example. First some things we need
type <- "deviance" ## "pearson" & "response" are other valid choices
resid <- residuals(b, type = type)
linpred <- napredict(b$na.action, b$linear.predictors)
observed.y <- napredict(b$na.action, b$y)
Note the last two lines are applying the NA handling method used when the model was fitted to the information on the linear.predictors and y, the stored copy of the response data.
The above code and that shown below is all given in the first 10 or so lines of the gam.check() source. To view this, just enter
gam.check
at the R prompt.
Each plot is produced as follows:
QQ plot
This is produced via qq.gam():
qq.gam(b, rep = 0, level = 0.9, type = type, rl.col = 2,
rep.col = "gray80")
Histogram of residuals
This is produced using
hist(resid, xlab = "Residuals", main = "Histogram of residuals")
Residuals vs linear predictor
This is produced using
plot(linpred, resid, main = "Resids vs. linear pred.",
xlab = "linear predictor", ylab = "residuals")
Observed vs fitted values
This is produced using
plot(fitted(b), observed.y, xlab = "Fitted Values",
ylab = "Response", main = "Response vs. Fitted Values")
There are now the two packages gratia and mgcViz which have functions to produce the gam.check output as ggplots which you can store as an object. The former doesn't print anything to console, the latter does.
require(gratia)
appraise(b)
require(mgcViz)
b = getViz(b)
check(b)

Getting vectors out of ggplot2

I am trying to show that there is a wierd "bump" in some data I am analysing (it is to do with market share. My code is here:-
qplot(Share, Rate, data = Dataset3, geom=c("point", "smooth"))
(I appreciate that this is not very useful code without the dataset).
Is there anyway that I can get the numeric vector used to generate the smoothed line out of R? I just need that layer to try to fit a model to the smoothed data.
Any help gratefully received.
Yes, there is. ggplot uses the function loess as the default smoother in geom_smooth. this means you can use loess directly to estimate your smoothing parameters.
Here is an example, adapted from ?loess :
qplot(speed, dist, data=cars, geom="smooth")
Use loess to estimate the smoothed data, and predict for the estimated values::
cars.lo <- loess(dist ~ speed, cars)
pc <- predict(cars.lo, data.frame(speed = seq(4, 25, 1)), se = TRUE)
The estimates are now in pc$fit and the standard error in pc$fit.se. The following bit of code extraxts the fitted values into a data.frame and then plots it using ggplot :
pc_df <- data.frame(
x=4:25,
fit=pc$fit)
ggplot(pc_df, aes(x=x, y=fit)) + geom_line()

Resources