Obtaining predictions from an mgcv::gam fit that contains a matrix "by" variable to a smooth - r

I just discovered that mgcv::s() permits one to supply a matrix to its by argument, permitting one to smooth a continuous variable with separate smooths for each of a combination of variables (and their interactions if so desired). However, I'm having trouble getting sensible predictions from such models, for example:
library(mgcv) #for gam
library(ggplot2) #for plotting
#Generate some fake data
set.seed(1) #for replicability of this example
myData = expand.grid(
var1 = c(-1,1)
, var2 = c(-1,1)
, z = -10:10
)
myData$y = rnorm(nrow(myData)) + (myData$z^2 + myData$z*4) * myData$var1 +
(3*myData$z^2 + myData$z) * myData$var2
#note additive effects of var1 and var2
#plot the data
ggplot(
data = myData
, mapping = aes(
x = z
, y = y
, colour = factor(var1)
, linetype = factor(var2)
)
)+
geom_line(
alpha = .5
)
#reformat to matrices
zMat = matrix(rep(myData$z,times=2),ncol=2)
xMat = matrix(c(myData$var1,myData$var2),ncol=2)
#get the fit
fit = gam(
formula = myData$y ~ s(zMat,by=xMat,k=5)
)
#get the predictions and plot them
predicted = myData
predicted$value = predict(fit)
ggplot(
data = predicted
, mapping = aes(
x = z
, y = value
, colour = factor(var1)
, linetype = factor(var2)
)
)+
geom_line(
alpha = .5
)
Yields this plot of the input data:
And this obviously awry plot of the predicted values:
Whereas replacing the gam fit above with:
fit = gam(
formula = y ~ s(z,by=var1,k=5) + s(z,by=var2,k=5)
, data = myData
)
but otherwise running the same code yields this reasonable plot of predicted values:
What am I doing wrong here?

The use of vector-valued inputs to mgcv smooths is taken up here. It seems to me that you are misunderstanding these model types.
Your first formula
myData$y ~ s(zMat,by=xMat,k=5)
fits the model
y ~ f(z)*x_1 + f(z)*x_2
That is, mgcv estimates a single smooth function f(). This function is evaluated at each covariate, with the weightings supplied to the by argument.
Your second formula
y ~ s(z,by=var1,k=5) + s(z,by=var2,k=5)
fits the model
y ~ f_1(z)*x_1 +f_2(z)*x_2
where f_1() and f_2() are two different smooth functions. Your data model is essentially the second formula, so it is not surprising that it gives a more sensible looking fit.
The first formula is useful when you want an additive model where a single function is evaluated on each variable, with given weightings.

Related

Unable to plot confidence intervals using ggplot, (geom_ribbon() argument)

I am trying to plot 95% confidence intervals on some simulated values but am running into so issues when i am trying to plot the CIs using the geom_ribbon() argument. The trouble I'm having it that my model does not show the CIs when i plot them, like so;
I have included all of my code below if anyone knows where i have gone wrong here;
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + .1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#transform the CI limit to get one at the level of the mean
upper_mod2 = exp(upper_mod2)/(1+exp(upper_mod2))
lower_mod2 = exp(lower_mod2)/(1+exp(lower_mod2))
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds2), col = 'blue')
In a comment to the question, it's asked why not to logit transform the predicted values. The reason why is that the type of prediction asked for is "response". From the documentation, my emphasis.
type
the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.
There is a good way to answer, to show the code.
library(ggplot2, quietly = TRUE)
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + 0.1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
suppressWarnings(
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
)
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds), col = 'blue')
Created on 2022-05-29 by the reprex package (v2.0.1)

How to convert log function in RStudio?

fit1 = lm(price ~ . , data = car)
fit2 = lm(log(price) ~ . , data = car)
I'm not sure how to convert log(price) to price in fit2 Won't it just become the same thing as fit1 if I do convert it? Please help.
Let's take a very simple example. Suppose I have some data points like this:
library(ggplot2)
df <- data.frame(x = 1:10, y = (1:10)^2)
(p <- ggplot(df, aes(x, y)) + geom_point())
I want to try to fit a model to them, but don't know what form this should take. I try a linear regression first and plot the resultant prediction:
mod1 <- lm(y ~ x, data = df)
(p <- p + geom_line(aes(y = predict(mod1)), color = "blue"))
Next I try a linear regression on log(y). Whatever results I get from predictions from this model will be predicted values of log(y). But I don't want log(y) predictions, I want y predictions, so I need to take the 'anti-log' of the prediction. We do this in R by doing exp:
mod2 <- lm(log(y) ~ x, data = df)
(p <- p + geom_line(aes(y = exp(predict(mod2))), color = "red"))
But we can see that we have different regression lines. That's because when we took the log of y, we were effectively fitting a straight line on the plot of log(y) against x. When we transform the axis back to a non-log axis, our straight line becomes an exponential curve. We can see this more clearly by drawing our plot again with a log-transformed y axis:
p + scale_y_log10(limits = c(1, 500))
Created on 2020-08-04 by the reprex package (v0.3.0)

How do I plot a single numerical covariate using emmeans (or other package) from a model?

After variable selection I usually end up in a model with a numerical covariable (2nd or 3rd degree). What I want to do is to plot using emmeans package preferentially. Is there a way of doing it?
I can do it using predict:
m1 <- lm(mpg ~ poly(disp,2), data = mtcars)
df <- cbind(disp = mtcars$disp, predict.lm(m1, interval = "confidence"))
df <- as.data.frame(df)
ggplot(data = df, aes(x = disp, y = fit)) +
geom_line() +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = disp, y = fit),alpha = 0.2)
I didn't figured out a way of doing it using emmip neither emtrends
For illustration purposes, how could I do it using mixed models via lme?
m1 <- lme(mpg ~ poly(disp,2), random = ~1|factor(am), data = mtcars)
I suspect that your issue is due to the fact that by default, covariates are reduced to their means in emmeans. You can use theat or cov.reduce arguments to specify a larger number of values. See the documentation for ref_grid and vignette(“basics”, “emmeans”), or the index of vignette topics.
Using sjPlot:
plot_model(m1, terms = "disp [all]", type = "pred")
gives the same graphic.
Using emmeans:
em1 <- ref_grid(m1, at = list(disp = seq(min(mtcars$disp), max(mtcars$disp), 1)))
emmip(em1, ~disp, CIs = T)
returns a graphic with a small difference in layout. An alternative is to add the result to an object and plot as the way that I want to:
d1 <- emmip(em1, ~disp, CIs = T, plotit = F)

brms package in R smoother

I have this data frame in R:
x = rep(seq(-10,10,1),each=5)
y = rep(0,length(x) )
weights = sample( seq(1,20,1) ,length(x), replace = TRUE)
weights = weights/sum(weights)
groups = rep( letters[1:5], times =length(x)/5 )
and some data that looks like this:
library(ggplot2)
ggplot(data = dat, aes(x = x, y = y, color = group))+geom_point( aes(size = weights))+
ylab("outcome")+
xlab("predictor x1")+
geom_vline(xintercept = 0)+ geom_hline(yintercept = 0)
fit_brms = brm(y~ s(x)+(1|group), data = dat)
by_group = marginal_effects(fit_brms, conditions = data.frame(group = dat$group) ,
re_formula = NULL, method = "predict")
plot(by_group, ncol = 5, points = TRUE)
I'd like to make a hierarchical nonlinear model so that there is a different nonlinear fit for each group.
In brms I have the code below which is doing a spline fit on the x predictor with random intercepts on group the fitted line is the same for all groups. the difference is where the lines cross the y intercept. Is there a way to make the non-linear fit be different for each group's data points?
ON page 13 here : https://cran.r-project.org/web/packages/brms/vignettes/brms_multilevel.pdf
It states "As the smooth term itself cannot be modeled as varying by year in a multilevel manner,we add a basic varying intercept in an effort to account for variation between years"
So the spline will be the same for all groups it appears? The only difference in the plots is where the spline cross the y intercept. That seems very restrictive. Can this be modified to make the spline unique to each group?
Use the formula: y ~ s(x, by = group) + (1|group)

R: GAM with fit on subset of data

I fit a Generalized Additive Model using gam from the mgcv package. I have a data table containing my dependent variable Y, an independent variable X, other independent variables Oth and a two-level factor Fac. I would like to fit the following model
Y ~ s(X) + Oth
BUT with the additional constraint that the s(X) term is fit only on one of the two levels of the factor, say Fac==1. The other terms Oth should be fit with the whole data.
I tried exploring s(X,by=Fac) but this biases the fit for Oth. In other words, I would like to express the belief that X relates to Y only if Fac==1, otherwise it does not make sense to model X.
Cheap trick: use an auxiliary variable that is X if Fac == 1 and 0 elsewhere.
library("mgcv")
library("ggplot2")
# simulate data
N <- 1e3
dat <- data.frame(covariate = runif(N),
predictor = runif(N),
group = factor(sample(0:1, N, TRUE)))
dat$outcome <- rnorm(N,
1 * dat$covariate +
ifelse(dat$group == 1,
.5 * dat$predictor +
1.5 * sin(dat$predictor * pi),
0), .1)
# some plots
ggplot(dat, aes(x = predictor, y = outcome,
col = group, group = group)) +
geom_point()
ggplot(dat, aes(x = covariate, y = outcome,
col = group, group = group)) +
geom_point()
# create auxiliary variable
dat$aux <- ifelse(dat$group == 1,
dat$predictor,
0)
# fit the data
fit1 <- gam(outcome ~ covariate + s(predictor, by = group),
data = dat)
fit2 <- gam(outcome ~ covariate + s(aux, by = group),
data = dat)
# compare fits
summary(fit1)
summary(fit2)
If I understand it right, you're thinking about some model with interaction like this:
Y ~ 0th + (Fac==1)*s(X)
If you want to "express the belief that X relates to Y only if Fac==1" don't treat Fac as a factor, but as a numeric variable. In this case you will get numeric interaction and only one set of coefficients (when it's a factor there where two). This type of model is a varying coefficient model.
# some data
data <- data.frame(th = runif(100),
X = runif(100),
Y = runif(100),
Fac = sample(0:1, 100, TRUE))
data$Fac<-as.numeric(as.character(data$Fac)) #change to numeric
# then run model
gam(Y~s(X, by=Fac)+th,data=data)
See the documentation for by option in the documentation ?s

Resources