How can I superimpose modified loess lines on a ggplot2 qplot? - r

Background
Right now, I'm creating a multiple-predictor linear model and generating diagnostic plots to assess regression assumptions. (It's for a multiple regression analysis stats class that I'm loving at the moment :-)
My textbook (Cohen, Cohen, West, and Aiken 2003) recommends plotting each predictor against the residuals to make sure that:
The residuals don't systematically covary with the predictor
The residuals are homoscedastic with respect to each predictor in the model
On point (2), my textbook has this to say:
Some statistical packages allow the analyst to plot lowess fit lines at the mean of the residuals (0-line), 1 standard deviation above the mean, and 1 standard deviation below the mean of the residuals....In the present case {their example}, the two lines {mean + 1sd and mean - 1sd} remain roughly parallel to the lowess {0} line, consistent with the interpretation that the variance of the residuals does not change as a function of X. (p. 131)
How can I modify loess lines?
I know how to generate a scatterplot with a "0-line,":
# First, I'll make a simple linear model and get its diagnostic stats
library(ggplot2)
data(cars)
mod <- fortify(lm(speed ~ dist, data = cars))
attach(mod)
str(mod)
# Now I want to make sure the residuals are homoscedastic
qplot (x = dist, y = .resid, data = mod) +
geom_smooth(se = FALSE) # "se = FALSE" Removes the standard error bands
But does anyone know how I can use ggplot2 and qplot to generate plots where the 0-line, "mean + 1sd" AND "mean - 1sd" lines would be superimposed? Is that a weird/complex question to be asking?

Apology
Folks, I want to apologize for my ignorance. Hadley is absolutely right, and the answer was right in front of me all along. As I suspected, my question was born of statistical, rather than programmatic ignorance.
We get the 68% Confidence Interval for Free
geom_smooth() defaults to loess smoothing, and it superimposes the +1sd and -1sd lines as part of the deal. That's what Hadley meant when he said "Isn't that just a 68% confidence interval?" I just completely forgot that's what the 68% interval is, and kept searching for something that I already knew how to do. It didn't help that I'd actually turned the confidence intervals off in my code by specifying geom_smooth(se = FALSE).
What my Sample Code Should Have Looked Like
# First, I'll make a simple linear model and get its diagnostic stats.
library(ggplot2)
data(cars)
mod <- fortify(lm(speed ~ dist, data = cars))
attach(mod)
str(mod)
# Now I want to make sure the residuals are homoscedastic.
# By default, geom_smooth is loess and includes the 68% standard error bands.
qplot (x = dist, y = .resid, data = mod) +
geom_abline(slope = 0, intercept = 0) +
geom_smooth()
What I've Learned
Hadley implemented a really beautiful and simple way to get what I'd wanted all along. But because I was focused on loess lines, I lost sight of the fact that the 68% confidence interval was bounded by the very lines I needed. Sorry for the trouble, everyone.

Could you calculate the +/- standard deviation values from the data and add a fitted curve of them to the plot?

Have a look at my question "modify lm or loess function.."
I am not sure I followed your question very well, but maybe a:
+ stat_smooth(method=yourfunction)
will work, provided that you define your function as described here.

Related

How to convert the y-axis of a plot from log(y) to y

I'm an R newbie. I want to estimate a regression of log(CONSUMPTION) on INCOME and then make a plot of CONSUMPTION and INCOME.
I can run the following regression and plot the results.
results <- lm(I(log(CONSUMPTION)) ~ INCOME, data=dataset)
effect_plot(results, pred=INCOME)
If I do this, I get log(CONSUMPTION) on the vertical axis rather than CONSUMPTION.
How can I get a plot with CONSUMPTION on the vertical axis?
Another way to ask the question is how do I convert the y-axis of a plot from log(y) to y? While my question is for the function effect_plot(), I would be happy with any plot function.
Thanks for any help you can give me.
Thank you for the responses. I was able to figure out a workaround using Poisson regression:
results1 <- glm(CONSUMPTION ~ INCOME+WEALTH, family=poisson, data=Consumption )
effect_plot(results1,pred=INCOME,data=Consumption)
This allows me to identify the effect of one variable (INCOME) even when the regression has more than one explanatory variable (INCOME+WEALTH), and plots the estimated effect with CONSUMPTION on the vertical axis rather than ln(CONSUMPTION), with INCOME on the horizontal axis.
The associated estimates are virtually identical to what I would get from the log-linear regression:
results2 <- lm(I(log(CONSUMPTION)) ~ INCOME+WEALTH, data=Consumption )
I appreciate you for taking the time to help me with my problem.

How to calculate risk difference from a regression model in R?

I know how to calculate risk difference with a 2x2 table, but I have no idea how to do this with a regression model, even though it is a quite widely used method when you need to adjust variables in question.
In case I'm not making any sense, here's an article that discusses proper ways to calculate risk difference, but unfortunately it doesn't contain any code: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-016-0217-0
If I have understood your question, I think the book Regression Modelling Strategies may be what you are looking for. For example:
# Load the rms Package by FE Harrell et al., remember to install the package first..
library(rms)
# Create a fit object using some dummy data in the package
fit <- npsurv(Surv(time, status) ~ x, data = aml)
# Then you can plot a Kaplan-Meier survival curve.
plot(fit)
# Then plot the 'Risk difference' for your data, with 95% confidence limits
survdiffplot(fit, xlim = c(0,60))

Clustering standard errors by a variable in a logistic regression - for graphing interaction plot (R)

I'm running a logistic regression/survival analysis where I cluster standard errors by a variable in the dataset. I'm using R.
Since this is not as straight forward as it is in STATA, I'm using a solution I found in the past : https://www.rdocumentation.org/packages/miceadds/versions/3.0-16/topics/lm.cluster
As an illustrative example of what I'm talking about:
model <- miceadds::glm.cluster(data = data, formula = outcome ~ a + b + c + years + years^2 + years^3, cluster = "cluster.id", family = "binomial")
This works well for getting the important values, this produces the coefficients, std. errors (clustered), and z-values. It took me forever just to get at this solution; and even now it is not ideal (like not being able to output to Stargazer). I've explored a lot of the other common suggestions on this issue - such as the Economic Theory solution (https://economictheoryblog.com/2016/12/13/clustered-standard-errors-in-r/); however, this is for lm() and I cannot get it to work for logistic regression.
I'm not beyond just running two models, one with glm() and one with glm.cluster() and replacing the standard errors in stargazer manually.
My concern is that I am at a loss as to how I would graph the above function, say if I were to do the following instead:
model <- miceadds::glm.cluster(data = data, formula = outcome ~ a*b + c + years + years^2 + years^3, cluster = "cluster.id", family = "binomial")
In this case, I want to graph a predicted probability plot to look at the interaction between a*b on my outcome; however, I cannot do so with the glm.cluster() object. I have to do it with a glm() model, but then my confidence intervals are awash.
I've been looking into a lot of the options on clustering standard errors for logistic regression around here, but am at a complete loss.
Has anyone found any recent developments on how to do so in r?
Are there any packages that let you cluster SE by a variable in the dataset and plot the objects? (Bonus points for interactions)
Any and all insight would be appreciated. Thanks!

Interpolation and Curve fitting with R

I am a chemical engineer and very new to R. I am attempting to build a tool in R (and eventually a shiny app) for analysis of phase boundaries. Using a simulation I get output that shows two curves which can be well represented by a 4th order polynomial. The data is as follows:
https://i.stack.imgur.com/8Oa0C.jpg
The procedure I have to follow uses the difference between the two curves to produce a second. In order to compare the curves, the data has to increase as a function of pressure in set increments, for example of 0.2 As can be seen, the data from the simulation is not incremental and there is no way to compare the curves based on the output.
To resolve this, in excel I carried out the following steps on each curve:
I plotted the data with pressure on the x axis and temperature on the y axis
Found the line of best fit using a 4th order polynomial
Used the equation of the curve to calculate the temperature at set increments of pressure
From this, I was able to compare the curves mathematically and produce the required output.
Does anyone have any suggestions how to carry this out in R, or if there is a more statistical or simplified approach that I have missed(extracting bezier curve points etc)?
As a bit of further detail, I have taken the data and merged it using tidyr so that the graphs (4 in total) are displayed in just three columns, the graph title, temperature and pressure. I did this after following a course on ggplot2 on Datacamp, but not sure if this format is suitable when carrying out regression etc? The head of my dataset can be seen here:
https://i.stack.imgur.com/WeaPz.jpg
I am very new to R, so apologies if this is a stupid question and I am using the wrong terms.
Though I agree with #Jaap's comment, polynomial regression is very easy in R. I'll give you the first line:
x <- c(0.26,3.33,5.25,6.54,7.38,8.1,8.73,9.3,9.81,10.28,10.69,11.08,11.43,11.75,12.05,12.33)
y <- c(16.33,24.6,31.98,38.38,43.3,48.18,53.08,57.99,62.92,67.86,72.81,77.77,82.75,87.75,92.77,97.81)
lm <- lm(y ~ x + I(x^2) + I(x^3) + I(x^4))
Now your polynomial coefficients are in lm$coef, you can extract them and easily plot the fitted line, e.g.:
coefs <- lm$coef
plot(x, y)
lines(x, coefs[1] + coefs[2] * x + coefs[3] * x^2 + coefs[4] * x^3 + coefs[5] * x^4)
The fitted values are also simply given using lm$fit. Build the same polynomial for the second curve and compare the coefficients, not just the "lines".

add a logarithmic regression line to a scatterplot (comparison with Excel)

In Excel, it's pretty easy to fit a logarithmic trend line of a given set of trend line. Just click add trend line and then select "Logarithmic." Switching to R for more power, I am a bit lost as to which function should one use to generate this.
To generate the graph, I used ggplot2 with the following code.
ggplot(data, aes(horizon, success)) + geom_line() + geom_area(alpha=0.3)+
stat_smooth(method='loess')
But the code does local polynomial regression fitting which is based on averaging out numerous small linear regressions. My question is whether there is a log trend line in R similar to the one used in Excel.
An alternative I am looking for is to get an log equation in form y = (c*ln(x))+b; is there a coef() function to get 'c' and 'b'?
Let my data be:
c(0.599885189,0.588404133,0.577784156,0.567164179,0.556257176,
0.545350172,0.535112897,0.52449292,0.51540375,0.507271336,0.499904325,
0.498851894,0.498851894,0.497321087,0.4964600,0.495885955,0.494068121,
0.492154612,0.490145427,0.486892461,0.482395714,0.477229238,0.471010333)
The above data are y-points while the x-points are simply integers from 1:length(y) in increment of 1. In Excel: I can simply plot this and add a logarithmic trend line and the result would look:
With black being the log. In R, how would one do this with the above dataset?
I prefer to use base graphics instead of ggplot2:
#some data with a linear model
x <- 1:20
set.seed(1)
y <- 3*log(x)+5+rnorm(20)
#plot data
plot(y~x)
#fit log model
fit <- lm(y~log(x))
#look at result and statistics
summary(fit)
#extract coefficients only
coef(fit)
#plot fit with confidence band
matlines(x=seq(from=1,to=20,length.out=1000),
y=predict(fit,newdata=list(x=seq(from=1,to=20,length.out=1000)),
interval="confidence"))
#some data with a non-linear model
set.seed(1)
y <- log(0.1*x)+rnorm(20,sd=0.1)
#plot data
plot(y~x)
#fit log model
fit <- nls(y~log(a*x),start=list(a=0.2))
#look at result and statistics
summary(fit)
#plot fit
lines(seq(from=1,to=20,length.out=1000),
predict(fit,newdata=list(x=seq(from=1,to=20,length.out=1000))))
You can easily specify alternative smoothing methods (such as lm(), linear least-squares fitting) and an alternative formula
library(ggplot2)
g0 <- ggplot(dat, aes(horizon, success)) + geom_line() + geom_area(alpha=0.3)
g0 + stat_smooth(method="lm",formula=y~log(x),fill="red")
The confidence bands are automatically included: I changed the color to make them visible since they're very narrow. You can use se=FALSE in stat_smooth to turn them off.
The other answer shows you how to get the coefficients:
coef(lm(success~log(horizon),data=dat))
I can imagine you might next want to add the equation to the graph: see Adding Regression Line Equation and R2 on graph
I'm pretty sure a simple +scale_y_log10() would get you what you wanted. GGPlot stats are calculated after transformations, so the loess() would then be calculated on the log transformed data.
I've just written a blog post here that describes how to match Excel's logarithmic curve fitting exactly. The nub of the approach centers around the lm() function:
# Set x and data.to.fit to the independent and dependent variables
data.to.fit <- c(0.5998,0.5884,0.5777,0.5671,0.5562,0.5453,0.5351,0.524,0.515,0.5072,0.4999,0.4988,0.4988,0.4973,0.49,0.4958,0.4940,0.4921,0.4901,0.4868,0.4823,0.4772,0.4710)
x <- c(seq(1, length(data.to.fit)))
data.set <- data.frame(x, data.to.fit)
# Perform a logarithmic fit to the data set
log.fit <- lm(data.to.fit~log(x), data=data.set)
# Print out the intercept, log(x) parameters, R-squared values, etc.
summary(log.fit)
# Plot the original data set
plot(data.set)
# Add the log.fit line with confidence intervals
matlines(predict(log.fit, data.frame(x=x), interval="confidence"))
Hope that helps.

Resources