Plot the observed and fitted values from a linear regression using xyplot() from the lattice package - r

I can create simple graphs. I would like to have observed and predicted values (from a linear regression) on the same graph. I am plotting say Yvariable vs Xvariable. There is only 1 predictor and only 1 response. How could I also add linear regression curve to the same graph?
So to conclude need help with:
plotting actuals and predicted both
plotting regression line

Here is one option for the observed and predicted values in a single plot as points. It is easier to get the regression line on the observed points, which I illustrate second
First some dummy data
set.seed(1)
x <- runif(50)
y <- 2.5 + (3 * x) + rnorm(50, mean = 2.5, sd = 2)
dat <- data.frame(x = x, y = y)
Fit our model
mod <- lm(y ~ x, data = dat)
Combine the model output and observed x into a single object for plott
res <- stack(data.frame(Observed = dat$y, Predicted = fitted(mod)))
res <- cbind(res, x = rep(dat$x, 2))
head(res)
Load lattice and plot
require("lattice")
xyplot(values ~ x, data = res, group = ind, auto.key = TRUE)
The resulting plot should look similar to this
To get just the regression line on the observed data, and the regression model is a simple straight line model as per the one I show then you can circumvent most of this and just plot using
xyplot(y ~ x, data = dat, type = c("p","r"), col.line = "red")
(i.e. you don't even need to fit the model or make new data for plotting)
The resulting plot should look like this
An alternative to the first example which can be used with anything that will give coefficients for the regression line is to write your own panel functions - not as scary as it seems
xyplot(y ~ x, data = dat, col.line = "red",
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(coef = coef(mod), ...) ## using mod from earlier
}
)
That gives a plot from Figure 2 above, but by hand.
Assuming you've done this with caret then
mod <- train(y ~ x, data = dat, method = "lm",
trControl = trainControl(method = "cv"))
xyplot(y ~ x, data = dat, col.line = "red",
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(coef = coef(mod$finalModel), ...) ## using mod from caret
}
)
Will produce a plot the same as Figure 2 above.

Another option is to use panel.lmlineq from latticeExtra.
library(latticeExtra)
set.seed(0)
xsim <- rnorm(50, mean = 3)
ysim <- (0 + 2 * xsim) * (1 + rnorm(50, sd = 0.3))
## basic use as a panel function
xyplot(ysim ~ xsim, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.lmlineq(x, y, adj = c(1,0), lty = 1,xol.text='red',
col.line = "blue", digits = 1,r.squared =TRUE)
})

Related

Unable to plot confidence intervals using ggplot, (geom_ribbon() argument)

I am trying to plot 95% confidence intervals on some simulated values but am running into so issues when i am trying to plot the CIs using the geom_ribbon() argument. The trouble I'm having it that my model does not show the CIs when i plot them, like so;
I have included all of my code below if anyone knows where i have gone wrong here;
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + .1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#transform the CI limit to get one at the level of the mean
upper_mod2 = exp(upper_mod2)/(1+exp(upper_mod2))
lower_mod2 = exp(lower_mod2)/(1+exp(lower_mod2))
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds2), col = 'blue')
In a comment to the question, it's asked why not to logit transform the predicted values. The reason why is that the type of prediction asked for is "response". From the documentation, my emphasis.
type
the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.
There is a good way to answer, to show the code.
library(ggplot2, quietly = TRUE)
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + 0.1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
suppressWarnings(
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
)
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds), col = 'blue')
Created on 2022-05-29 by the reprex package (v2.0.1)

Use lattice xyplot for grouped data with points and lines

I wish to make a single-panel graph in lattice that shows data (y) from several groups (g) with superimposed lines showing predicted values (y_pred). I generate example data below:
d <- data.frame(x = rep(1:100,2), g = factor(rep(c('a','b'), each = 100)))
d$y_pred <- -0.1*x + 0.001*x^2
d$y_pred <- with(d, ifelse(g=='a', y_pred+2,y_pred))
d$y <- d$y_pred + rnorm(nrow(d),0,1)
Using 'type=c('p','l'), distribute.type=TRUE does not work, nor does my attempt at making a panel:
xyplot(y + y_pred ~ x, data = d,
groups = g,
panel = panel.superpose,
panel.groups=function(...){
panel.xyplot(x, y, type='p')
panel.xyplot(x, y_pred, type='l')
}
)
What should I do here?
Ok, you can do this with latticeExtra
dat <- xyplot(y ~ x, data=d,
groups = g,
type="p"
)
dat
dat + layer(panel.xyplot(x=x, y=y_pred,
groups = g,
type="l",
subscripts=TRUE),
data=d)
But this is really finicky and non-robust. The 'dat + layer()' code will render properly if it is the first thing I run after starting the program, but after that it will frequently render with several of lines for different groups missing.

brms package in R smoother

I have this data frame in R:
x = rep(seq(-10,10,1),each=5)
y = rep(0,length(x) )
weights = sample( seq(1,20,1) ,length(x), replace = TRUE)
weights = weights/sum(weights)
groups = rep( letters[1:5], times =length(x)/5 )
and some data that looks like this:
library(ggplot2)
ggplot(data = dat, aes(x = x, y = y, color = group))+geom_point( aes(size = weights))+
ylab("outcome")+
xlab("predictor x1")+
geom_vline(xintercept = 0)+ geom_hline(yintercept = 0)
fit_brms = brm(y~ s(x)+(1|group), data = dat)
by_group = marginal_effects(fit_brms, conditions = data.frame(group = dat$group) ,
re_formula = NULL, method = "predict")
plot(by_group, ncol = 5, points = TRUE)
I'd like to make a hierarchical nonlinear model so that there is a different nonlinear fit for each group.
In brms I have the code below which is doing a spline fit on the x predictor with random intercepts on group the fitted line is the same for all groups. the difference is where the lines cross the y intercept. Is there a way to make the non-linear fit be different for each group's data points?
ON page 13 here : https://cran.r-project.org/web/packages/brms/vignettes/brms_multilevel.pdf
It states "As the smooth term itself cannot be modeled as varying by year in a multilevel manner,we add a basic varying intercept in an effort to account for variation between years"
So the spline will be the same for all groups it appears? The only difference in the plots is where the spline cross the y intercept. That seems very restrictive. Can this be modified to make the spline unique to each group?
Use the formula: y ~ s(x, by = group) + (1|group)

using lines() with 'multiple x entries'

I'm looking for a way to plot a nonlinear regression line on a data set where every value in my vector y is being stored multiple times, so I tried to use something like:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(1,4,9,15,25,9,36,25,36,25)
reg4 <- lm( x ~ y + I(y^2) )
plot(x ~ y)
lines(y, predict(reg4), type="l", col="red", lwd=1)
this gives http://i.imgur.com/qSEVNdT.png
So my question is, is there a way to, let's say, use some sort of mean value for each y entry? Or well just make it a 'continous' line instead of something that branches of into multiple lines/returns to a lower y value at the points where there are multiple 'entries'.
In these cases, it is best to predict from the model over the range of the covariate. You do this for say 50 or 100 locations equally spaced over the range of x. Increasing or decreasing the number of locations to predict at as needed - more complex responses will need more locations etc. Doing this also solves the spaghetti plot issue as the newdata supplied will be in the order of x
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(1,4,9,15,25,9,36,25,36,25)
reg4 <- lm( x ~ y + I(y^2) )
## predictions
pred <- data.frame(y = seq(min(y), max(y), length = 100))
pred <- transform(pred, x = predict(reg4, newdata = pred))
## plot
plot(x ~ y)
lines(x ~ y, data = pred, type = "l", col = "red", lwd = 1)
The problem does not come from the ties in the data:
for a given value of y, there is only one forecast.
The problem is that the points are not sorted,
so that when you join them, you end up with a tangle of lines.
You can use order to reorder the points.
plot(
x ~ y,
xlab = "y", ylab = "x" # Confusing...
)
i <- order(y)
lines( y[i], predict(reg4)[i] )

Linear regression in R (normal and logarithmic data)

I want to carry out a linear regression in R for data in a normal and in a double logarithmic plot.
For normal data the dataset might be the follwing:
lin <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
plot (lin$x, lin$y)
There I want to calculate draw a line for the linear regression only of the datapoints 2, 3 and 4.
For double logarithmic data the dataset might be the following:
data = data.frame(
x=c(1:15),
y=c(
1.000, 0.742, 0.623, 0.550, 0.500, 0.462, 0.433,
0.051, 0.043, 0.037, 0.032, 0.028, 0.025, 0.022, 0.020
)
)
plot (data$x, data$y, log="xy")
Here I want to draw the regression line for the datasets 1:7 and for 8:15.
Ho can I calculate the slope and the y-offset als well as parameters for the fit (R^2, p-value)?
How is it done for normal and for logarithmic data?
Thanks for you help,
Sven
In R, linear least squares models are fitted via the lm() function. Using the formula interface we can use the subset argument to select the data points used to fit the actual model, for example:
lin <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
linm <- lm(y ~ x, data = lin, subset = 2:4)
giving:
R> linm
Call:
lm(formula = y ~ x, data = lin, subset = 2:4)
Coefficients:
(Intercept) x
-1.633 1.500
R> fitted(linm)
2 3 4
-0.1333333 1.3666667 2.8666667
As for the double log, you have two choices I guess; i) estimate two separate models as we did above, or ii) estimate via ANCOVA. The log transformation is done in the formula using log().
Via two separate models:
logm1 <- lm(log(y) ~ log(x), data = dat, subset = 1:7)
logm2 <- lm(log(y) ~ log(x), data = dat, subset = 8:15)
Or via ANCOVA, where we need an indicator variable
dat <- transform(dat, ind = factor(1:15 <= 7))
logm3 <- lm(log(y) ~ log(x) * ind, data = dat)
You might ask if these two approaches are equivalent? Well they are and we can show this via the model coefficients.
R> coef(logm1)
(Intercept) log(x)
-0.0001487042 -0.4305802355
R> coef(logm2)
(Intercept) log(x)
0.1428293 -1.4966954
So the two slopes are -0.4306 and -1.4967 for the separate models. The coefficients for the ANCOVA model are:
R> coef(logm3)
(Intercept) log(x) indTRUE log(x):indTRUE
0.1428293 -1.4966954 -0.1429780 1.0661152
How do we reconcile the two? Well the way I set up ind, logm3 is parametrised to give more directly values estimated from logm2; the intercepts of logm2 and logm3 are the same, as are the coefficients for log(x). To get the values equivalent to the coefficients
of logm1, we need to do a manipulation, first for the intercept:
R> coefs[1] + coefs[3]
(Intercept)
-0.0001487042
where the coefficient for indTRUE is the difference in the mean of group 1 over the mean of group 2. And for the slope:
R> coefs[2] + coefs[4]
log(x)
-0.4305802
which is the same as we got for logm1 and is based on the slope for group 2 (coefs[2]) modified by the difference in slope for group 1 (coefs[4]).
As for plotting, an easy way is via abline() for simple models. E.g. for the normal data example:
plot(y ~ x, data = lin)
abline(linm)
For the log data we might need to be a bit more creative, and the general solution here is to predict over the range of data and plot the predictions:
pdat <- with(dat, data.frame(x = seq(from = head(x, 1), to = tail(x,1),
by = 0.1))
pdat <- transform(pdat, yhat = c(predict(logm1, pdat[1:70,, drop = FALSE]),
predict(logm2, pdat[71:141,, drop = FALSE])))
Which can plot on the original scale, by exponentiating yhat
plot(y ~ x, data = dat)
lines(exp(yhat) ~ x, dat = pdat, subset = 1:70, col = "red")
lines(exp(yhat) ~ x, dat = pdat, subset = 71:141, col = "blue")
or on the log scale:
plot(log(y) ~ log(x), data = dat)
lines(yhat ~ log(x), dat = pdat, subset = 1:70, col = "red")
lines(yhat ~ log(x), dat = pdat, subset = 71:141, col = "blue")
For example...
This general solution works well for the more complex ANCOVA model too. Here I create a new pdat as before and add in an indicator
pdat <- with(dat, data.frame(x = seq(from = head(x, 1), to = tail(x,1),
by = 0.1)[1:140],
ind = factor(rep(c(TRUE, FALSE), each = 70))))
pdat <- transform(pdat, yhat = predict(logm3, pdat))
Notice how we get all the predictions we want from the single call to predict() because of the use of ANCOVA to fit logm3. We can now plot as before:
plot(y ~ x, data = dat)
lines(exp(yhat) ~ x, dat = pdat, subset = 1:70, col = "red")
lines(exp(yhat) ~ x, dat = pdat, subset = 71:141, col = "blue")
#Split the data into two groups
data1 <- data[1:7, ]
data2 <- data[8:15, ]
#Perform the regression
model1 <- lm(log(y) ~ log(x), data1)
model2 <- lm(log(y) ~ log(x), data2)
summary(model1)
summary(model2)
#Plot it
with(data, plot(x, y, log="xy"))
lines(1:7, exp(predict(model1, data.frame(x = 1:7))))
lines(8:15, exp(predict(model2, data.frame(x = 8:15))))
In general, splitting the data into different groups and running different models on different subsets is unusual, and probably bad form. You may want to consider adding a grouping variable
data$group <- factor(rep(letters[1:2], times = 7:8))
and running some sort of model on the whole dataset, e.g.,
model_all <- lm(log(y) ~ log(x) * group, data)
summary(model_all)

Resources