Making surface plot of regression estimates from multiple continuous variables - r

I have a multi-level model with categorical and continuous variables and splines. Nice and complex. Anyhow I am trying to visualize model fit.
For example, here is some toy data:
library(lme4)
library(rms)
library(gridExtra)
## Make model using sleepstudy data
head(sleepstudy)
# Add some extra vars
sleepstudy$group <- factor( sample(c(1,2), nrow(sleepstudy), replace=TRUE) )
sleepstudy$x1 <- jitter(sleepstudy$Days, factor=5)^2 * jitter(sleepstudy$Reaction)
# Set up a mixed model with spline
fm1 <- lmer(Reaction ~ rcs(Days, 4) * group + (rcs(Days, 4) | Subject), sleepstudy)
# Now add continuous covar
fm2 <- lmer(Reaction ~ rcs(Days, 4) * group + x1 + (rcs(Days, 4) | Subject), sleepstudy)
# Plot fit
new.df <- sleepstudy
new.df$pred1 <- predict(fm1, new.df, allow.new.levels=TRUE, re.form=NA)
new.df$pred2 <- predict(fm2, new.df, allow.new.levels=TRUE, re.form=NA)
g1 <- ggplot(data=new.df, aes(x=Days)) +
geom_line(aes(y=pred1, col=group), size=2) +
ggtitle("Model 1")
g2 <- ggplot(data=new.df, aes(x=Days)) +
geom_line(aes(y=pred2, col=group), size=2) +
ggtitle("Model 2")
grid.arrange(g1, g2, nrow=1)
Plot 1 is smooth, but plot 2 is jagged due to the effect of x1. So I would like to make a surface plot with x = Days, y = x1 and z = pred2 and stratified by group. Not having experience of surface plots I've started out with the wireframe command:
wireframe(pred2 ~ Days * x1, data = new.df[new.df$group==1,],
xlab = "Days", ylab = "x1", zlab="Predicted fit"
)
However although this command does not give an error, my plot is blank:
Questions:
Where am I going wrong with my wireframe?
Is there a better way to visualize my model fit?

I figured out that the data format needed for a wireframe' orplot_ly' surface is that of a 2D matrix of x rows by y columns of corresponding z values (I got a hint towards this from this question Plotly 3d surface graph has incorrect x and y axis values). I also realised I could use `expand.grid' to make a matrix covering the range of possible x and y values and use those to predict z as follows:
days <- 0:9
x1_range <- range(sleepstudy$x1)[2] * c(0.05, 0.1, 0.15, 0.2, 0.25, 0.3)
new.data2 <- expand.grid(Days = days, x1 = x1_range, group = unique(sleepstudy$group) )
new.data2$pred <- predict(fm2, new.data2, allow.new.levels=TRUE, re.form=NA)
I can then stuff those into two different matrices to represent the z-surface for each group in my model:
surf1 <- ( matrix(new.data2[new.data2$group == 1, ]$pred, nrow = length(days), ncol = length(x1_range)) )
surf2 <- ( matrix(new.data2[new.data2$group == 2, ]$pred, nrow = length(days), ncol = length(x1_range)) )
group <- c(rep(1, nrow(surf1)), rep(2, nrow(surf2) ))
Finally I can use plot_ly to plot each surface:
plot_ly (z=surf1, x = mets_range, y = ages, type="surface") %>%
add_surface (z = surf2, surfacecolor=surf2,
color=c('red','yellow'))
The resulting plot:
So the resulting plot is what I wanted (albeit not very useful in this made up example but useful in real data). The only thing I can't figure out is how to show two different color scales. I can suppres the scale altogether but if anyone knows how to show 2 scales for different surfaces do please let me know and I will edit the answer.

Related

how to force a polynomial regression to be very flexible for big sharp turns while reducing flexibility for small turns

Is it possible to force a polynomial regression to be very flexible for big sharp turns while reducing flexibility for small turns?
The reason is that I have a variabele y which depends on x for which I am sure there is a positive correlation, although my dataset contains some noise. This is why I do not want a wiggly line for x between 1 and 75 as in the graph below.
library(ggplot2)
library(dplyr)
x <- c(1:100)
y <- c(1,3,6,12,22,15,13,11,5,1,3,6,12,22,11,5,1,3,6,12,22,11,5,5,9,10,1,6,12,22,15,13,11,5,-1,-12,-23,6,12,22,11,5,1,3,6,12,22,11,5,-11,-22,-9,12,22,11,5,9,10,18,1,3,6,12,22,15,13,11,5,-5, -9, -12,6,12,22,11,5,1,3,6,12,22,11,5,1,3,6,12,22,11,5,9,10,18,28,37,50,90,120,150,200)
y <- y + x
df <- data.frame(x, y)
model <- lm(y ~ poly(x, 6, raw = TRUE), data = df)
predictions <- model %>% predict(df)
df <- cbind(df, predictions)
ggplot() +
geom_point(data = df, aes(x = x, y = y), size = 0.1) +
geom_line(data = df, aes(x = x, y = predictions), colour="blue", size=0.1)
I can alter the model to:
model <- lm(y ~ poly(x, 2, raw = TRUE), data = df)
Which gives this graph:
In this case the model is without wigglyness for x between 0 and 90 although it is missing the flexibility to make the turn around x is 90.
I am not looking for a specific solution for this example dataset. I am looking for a solution to have a polynomial regression flexible enough to make sharp (big) turns, although reducing wigglyness for small turns at the same time. (Maybe this can be solved by a limit to make a maximum of n turns?)
I want to use it automated at several datasets. For this reason I do not want to specify different ranges of x for different models.
Thank you!
I have also tried using gam from mgcv, although this gives similar results:
mygam <- gam(y ~ s(x, k = 7), data = df)
mygam <- gam(y ~ s(x, k = 3), data = df)
This graph is based on pmax(p1, p2) where p1 and p2 are two polynomials:

How do I add the curve from GauPro in ggplot?

GauPro is an R library for fitting gaussian processes. You can also get it to produce a nuce predicted curve for you.
The documentation for GauPro uses builtin r plotting functions to do plots like this:
gp <- GauPro(x,y) ## fit a gaussian process model to x & y
plot(x,y) ## plots the x,y points
curve(gp$predict(x), add=T, col=2) ## adds the predicted curve from the gaussian process
What would be the equivalent using ggplot? I can get the points to show up, but I can't quite figure out how to add the curve.
GauPro documentation I refer to is here
We can do this by building a little data frame of predictions. Let's start by loading the necessary packages and creating some sample data:
library(GauPro)
library(ggplot2)
set.seed(69)
x <- 1:10
y <- cumsum(runif(10))
Now we can create our model and plot it using the same plotting functions shown in the vignette you linked:
gp <- GauPro(x, y)
plot(x, y)
curve(gp$predict(x), add = TRUE, col = 2)
Now if we want to customize this plot using ggplot, we need a data frame with columns for the x values at which we wish to predict, the y prediction at that point, and a column each for upper and lower 95% confidence intervals. We can obtain the x values like this:
new_x <- seq(min(x), max(x), length.out = 100)
and we can get the three sets of corresponding y values using predict like this:
predict_df <- predict(gp, new_x, se.fit = TRUE)
predict_df$x <- new_x
predict_df$y <- predict_df$mean
predict_df$lower <- predict_df$y - 1.96 * predict_df$se
predict_df$upper <- predict_df$y + 1.96 * predict_df$se
this is now quite straightforward to plot in ggplot with themes customized as you choose:
ggplot(data.frame(x, y), aes(x, y)) +
geom_point() +
geom_line(data = predict_df, color = "deepskyblue4", linetype = 2) +
geom_ribbon(data = predict_df, aes(ymin = lower, ymax = upper),
alpha = 0.2, fill = "deepskyblue4") +
theme_minimal()
Created on 2020-07-29 by the reprex package (v0.3.0)

Difference between two geom_smooth() lines

I made a plot for my data and am now I would like to have the difference in y for every x that was estimated by geom_smooth(). There is a similiar question which unfortunately has no answer. For example, how to get the differences for the following plot (data below):
EDIT
Two suggestions were made but I still don't know how to calculate the differences.
First suggestion was to access the data from the ggplot object. I did so with
pb <- ggplot_build(p)
pb[["data"]][[1]]
That approach kind of works, but the data doesn't use the same x values for the groups. For example, the first x value of the first group is -3.21318853, but there is no x of -3.21318853 for the second group, hence, I can not calculate the difference in y for -3.21318853 between both groups
Second suggestion was to see what formula is used in geom_smooth(). The package description says that "loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs")". My N is more than 60,000, hence, gam is used by default. I am not familiar with gam; can anyone provide a short answer how to calculate the difference between the two lines considering the things just described?
R Code
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
Hi and welcome on Stack Overflow,
The first suggestion is good. To make the x-sequences match, you can interpolate the values in between using the approx function (in stats).
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
p <- ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
pb <- ggplot_build(p) # Get computed data
data.of.g1 <- pb[['data']][[1]][pb[['data']][[1]]$group == 1, ] # Extract info for group 1
data.of.g2 <- pb[['data']][[1]][pb[['data']][[1]]$group == 2, ] # Extract info for group 2
xlimit.inf <- max(min(data.of.g1$x), min(data.of.g2$x)) # Get the minimum X the two smoothed data have in common
xlimit.sup <- min(max(data.of.g1$x), max(data.of.g2$x)) # Get the maximum X
xseq <- seq(xlimit.inf, xlimit.sup, 0.01) # Sequence of X value (you can use bigger/smaller step size)
# Based on data from group 1 and group 2, interpolates linearly for all the values in `xseq`
y.g1 <- approx(x = data.of.g1$x, y = data.of.g1$y, xout = xseq)
y.g2 <- approx(x = data.of.g2$x, y = data.of.g2$y, xout = xseq)
difference <- data.frame(x = xseq, dy = abs(y.g1$y - y.g2$y)) # Compute the difference
ggplot(difference, aes(x = x, y = dy)) + geom_line() # Make the plot
Output:
As I mentioned in the comments above, you really are better off doing this outside of ggplot and instead do it with a full model of the two smooths from which you can compute uncertainties on the difference, etc.
This is basically a short version of a blog post that I wrote a year or so back.
OP's exmaple data
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
Start by fitting the model for the example data:
library("mgcv")
m <- gam(y ~ g + s(x, by = g), data = df, method = "REML")
Here I'm fitting a GAM with a factor-smooth interaction (the by bit) and for this model we need to also include g as a parametric effect as the group-specific smooths are both centred about 0 so we need to include the group means in the parametric part of the model.
Next we need a grid of data along the x variable at which we will estimate the difference between the two estimated smooths:
pdat <- with(df, expand.grid(x = seq(min(x), max(x), length = 200),
g = c(0,1)))
pdat <- transform(pdat, g = factor(g))
then we use this prediction data to generate the Xp matrix, which is a matrix that maps values of the covariates to values of the basis expansion for the smooths; we can manipulate this matrix to get the difference smooth that we want:
xp <- predict(m, newdata = pdat, type = "lpmatrix")
Next some code to identify which rows and columns in xp belong to the smooths for the respective levels of g; as there are only two levels and only a single smooth term in the model, this is entirely trivial but for more complex models this is needed and it is important to get the smooth component names right for the grep() bits to work.
## which cols of xp relate to splines of interest?
c1 <- grepl('g0', colnames(xp))
c2 <- grepl('g1', colnames(xp))
## which rows of xp relate to sites of interest?
r1 <- with(pdat, g == 0)
r2 <- with(pdat, g == 1)
Now we can difference the rows of xp for the pair of levels we are comparing
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
As we focus on the difference, we need to zero out all the column not associated with the selected pair of smooths, which includes any parametric terms.
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
(In this example, these two lines do exactly the same thing, but in more complex examples both are needed.)
Now we have a matrix X which contains the difference between the two basis expansions for the pair of smooths we're interested in, but to get this in terms of fitted values of the response y we need to multiply this matrix by the vector of coefficients:
## difference between smooths
dif <- X %*% coef(m)
Now dif contains the difference between the two smooths.
We can use X again and covariance matrix of the model coefficients to compute the standard error of this difference and thence a 95% (in this case) confidence interval for the estimate difference.
## se of difference
se <- sqrt(rowSums((X %*% vcov(m)) * X))
## confidence interval on difference
crit <- qt(.975, df.residual(m))
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
Note that here with the vcov() call we're using the empirical Bayesian covariance matrix but not the one corrected for having chosen the smoothness parameters. The function I show shortly allows you to account for this additional uncertainty via argument unconditional = TRUE.
Finally we gather the results and plot:
res <- data.frame(x = with(df, seq(min(x), max(x), length = 200)),
dif = dif, upr = upr, lwr = lwr)
ggplot(res, aes(x = x, y = dif)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = x), alpha = 0.2) +
geom_line()
This produces
Which is consistent with an assessment that shows the model with the group-level smooths doesn't provide substantially better fit than a model with different group means but only single common smoother in x:
r$> m0 <- gam(y ~ g + s(x), data = df, method = "REML")
r$> AIC(m0, m)
df AIC
m0 9.68355 30277.93
m 14.70675 30285.02
r$> anova(m0, m, test = 'F')
Analysis of Deviance Table
Model 1: y ~ g + s(x)
Model 2: y ~ g + s(x, by = g)
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 4990.1 124372
2 4983.9 124298 6.1762 73.591 0.4781 0.8301
Wrapping up
The blog post I mentioned has a function which wraps the steps above into a simple function, smooth_diff():
smooth_diff <- function(model, newdata, f1, f2, var, alpha = 0.05,
unconditional = FALSE) {
xp <- predict(model, newdata = newdata, type = 'lpmatrix')
c1 <- grepl(f1, colnames(xp))
c2 <- grepl(f2, colnames(xp))
r1 <- newdata[[var]] == f1
r2 <- newdata[[var]] == f2
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
dif <- X %*% coef(model)
se <- sqrt(rowSums((X %*% vcov(model, unconditional = unconditional)) * X))
crit <- qt(alpha/2, df.residual(model), lower.tail = FALSE)
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
data.frame(pair = paste(f1, f2, sep = '-'),
diff = dif,
se = se,
upper = upr,
lower = lwr)
}
Using this function we can repeat the entire analysis and plot the difference with:
out <- smooth_diff(m, pdat, '0', '1', 'g')
out <- cbind(x = with(df, seq(min(x), max(x), length = 200)),
out)
ggplot(out, aes(x = x, y = diff)) +
geom_ribbon(aes(ymin = lower, ymax = upper, x = x), alpha = 0.2) +
geom_line()
I won't show the plot here as it is identical to that shown above except for the axis labels.

`scatterplot3d`: can not add a regression plane to 3D scatter plot

I have created a 3d Scatterplot in R and want to add a regression plane. I have looked at code from the statmethods.net website, which can be very useful, and it worked. I then tried it with my own data and the plane did not show up.
library(scatterplot3d)
s3d <- scatterplot3d(Try$Visits, Try$Net.Spend, Try$Radio, pch=16, highlight.3d = TRUE, type = "h", main = "3D Scatterplot")
fit <- lm(Try$Visits ~ Try$Net.Spend +Try$Radio)
s3d$plane3d(fit)
I can not reproduce the issue with the following reproducible example:
set.seed(0)
x <- runif(20)
y <- runif(20)
z <- 0.1 + 0.3 * x + 0.5 * y + rnorm(20, sd = 0.1)
dat <- data.frame(x, y, z)
rm(x,y,z)
fit <- lm(z ~ x + y, data = dat)
library(scatterplot3d)
s3d <- scatterplot3d(dat$x, dat$y, dat$z, pch=16, highlight.3d = TRUE, type = "h", main = "3D Scatterplot")
s3d$plane3d(fit)
You should avoid $ in model formula. Use data argument instead:
fit <- lm(Visits ~ Net.Spend + Radio, data = Try)
Your z-variable(dependent variable) in the scatter plot is Try$Radio whereas in the regression model, the dependent variable is Try$Visits and this is causing confusion. The 3rd variable in the scatter plot argument is treated as the dependent variable R.

Custom ggplot2 shaded error areas on categorical line plot

I'm trying to plot a line, smoothed by loess, but I'm trying to figure out how to include shaded error areas defined by existing variables, but also smoothed.
This code creates example data:
set.seed(12345)
data <- cbind(rep("A", 100), rnorm(100, 0, 1))
data <- rbind(data, cbind(rep("B", 100), rnorm(100, 5, 1)))
data <- rbind(data, cbind(rep("C", 100), rnorm(100, 10, 1)))
data <- rbind(data, cbind(rep("D", 100), rnorm(100, 15, 1)))
data <- cbind(rep(1:100, 4), data)
data <- data.frame(data)
names(data) <- c("num", "category", "value")
data$num <- as.numeric(data$num)
data$value <- as.numeric(data$value)
data$upper <- data$value+0.20
data$lower <- data$value-0.30
Plotting the data below, this is what I get:
ggplot(data, aes(x=num, y=value, colour=category)) +
stat_smooth(method="loess", se=F)
What I'd like is a plot that looks like the following, except with the upper and lower bounds of the shaded areas being bounded by smoothed lines of the "upper" and "lower" variables in the generated data.
Any help would be greatly appreciated.
Here's one way to add smoothed versions of upper and lower. We'll add LOESS predictions for upper and lower to the data frame and then plot those using geom_ribbon. It would be more elegant if this could all be done within the call to ggplot. That's probably possible by feeding a special-purpose function to stat_summary, and hopefully someone else will post an answer using that approach.
# Expand the scale of the upper and lower values so that the difference
# is visible in the plot
data$upper = data$value + 10
data$lower = data$value - 10
# Order data by category and num
data = data[order(data$category, data$num),]
# Create LOESS predictions for the values of upper and lower
# and add them to the data frame. I'm sure there's a better way to do this,
# but my attempts with dplyr and tapply both failed, so I've resorted to the clunky
# method below.
data$upperLoess = unlist(lapply(LETTERS[1:4],
function(x) predict(loess(data$upper[data$category==x] ~
data$num[data$category==x]))))
data$lowerLoess = unlist(lapply(LETTERS[1:4],
function(x) predict(loess(data$lower[data$category==x] ~
data$num[data$category==x]))))
# Use geom_ribbon to add a prediction band bounded by the LOESS predictions for
# upper and lower
ggplot(data, aes(num, value, colour=category, fill=category)) +
geom_smooth(method="loess", se=FALSE) +
geom_ribbon(aes(x=num, y=value, ymax=upperLoess, ymin=lowerLoess),
alpha=0.2)
And here's the result:

Resources