How to predict gam model with random effect in R? - r

I am working on predicting gam model with random effect to produce 3D surface plot by plot_ly.
Here is my code;
x <- runif(100)
y <- runif(100)
z <- x^2 + y + rnorm(100)
r <- rep(1,times=100) # random effect
r[51:100] <- 2 # replace 1 into 2, making two groups
df <- data.frame(x, y, z, r)
gam_fit <- gam(z ~ s(x) + s(y) + s(r,bs="re"), data = df) # fit
#create matrix data for `add_surface` function in `plot_ly`
newx <- seq(0, 1, len=20)
newy <- seq(0, 1, len=30)
newxy <- expand.grid(x = newx, y = newy)
z <- matrix(predict(gam_fit, newdata = newxy), 20, 30) # predict data as matrix
However, the last line results in error;
Error in model.frame.default(ff, data = newdata, na.action = na.act) :
variable lengths differ (found for 'r')
In addition: Warning message:
In predict.gam(gam_fit, newdata = newxy) :
not all required variables have been supplied in newdata!
Thanks to the previous answer, I am sure that above codes work without random effect, as in here.
How can I predict gam models with random effect?

Assuming you want the surface conditional upon the random effects (but not for a specific level of the random effect), there are two ways.
The first is to provide a level for the random effect but exclude that term from the predicted values using the exclude argument to predict.gam(). The second is to again use exclude but this time to not provide any data for the random effect and instead stop predict.gam() from checking the newdata using the argument newdata.guaranteed = TRUE.
Option 1:
newxy1 <- with(df, expand.grid(x = newx, y = newy, r = 2))
z1 <- predict(gam_fit, newdata = newxy1, exclude = 's(r)')
z1 <- matrix(z1, 20, 30)
Option 2:
z2 <- predict(gam_fit, newdata = newxy, exclude = 's(r)',
newdata.guaranteed=TRUE)
z2 <- matrix(z2, 20, 30)
These produce the same result:
> all.equal(z1, z2)
[1] TRUE
A couple of notes:
Which you use will depend on how complex the rest of you model is. I would generally use the first option as it provides an extra check against me doing something stupid when creating the data. But in this instance, with a simple model and set of covariates it seems safe enough to trust that newdata is OK.
Your example uses a random slope (was that intended?), not a random intercept as r is not a factor. If your real example uses a factor random effect then you'll need to be a little more careful when creating the newdata as you need to get the levels of the factor right. For example:
expand.grid(x = newx, y = newy,
r = with(df, factor(2, levels = levels(r))))
should get the right set-up for a factor r

Related

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

Fit a GEE-model of type "exchangeable" with gamm

I would like to estimate a smooth effect of some covariate N in a marginal model of type "exchangeable" in R, where the clustering variable is S. From what I could find, this should be possible with:
geeglm(..., id = S, corstr = "exchangeable")
as well as:
gamm(..., correlation = corCompSymm(form = ~1|S))
Below you can find an example where the results look good in a sense that the two estimates are quite close. However, if I use the real data our project is about, the estimated smooth effects tend to be very different. I cannot publish that here, but maybe someone can still spot some problem in the code. For instance (see below), the gamm-object says Number of Groups: 1 which worries me as there clearly is more than one cluster...
(Yes, this is the realisation of a random-effects-model by construction, but this should lead to the desired model given the answer here.)
########
## Packages
########
library(ggplot2)
library(mgcv)
library(dplyr)
library(geepack)
library(splines)
########
## Data Simulation
########
f <- function(N) {return((-200+(N-25)^2)/100)}
N <- sort(sample(1:50, 10, replace = T))
S <- as.character(1:10)
S_Effect <- rnorm(length(S),0,1)
S_Effect <- rep(S_Effect,N)
S <- rep(S,N)
N <- rep(N,N)
E <- runif(length(N))
data <- data.frame(O = rep(0,length(N)),
E = E,
N = N,
S = as.factor(S),
S_Effect = S_Effect)
for (i in 1:length(N)) {
data$O[i] <- rbinom(1, 1, plogis(f(N[i]) + qlogis(E[i]) + S_Effect[i]))}
data <- data %>% mutate(E = qlogis(E))
########
## Fitting
########
formula_gamm <- as.formula("O ~ 1 + offset(E) + s(N, bs = 'bs')")
model_gamm <- gamm(formula_gamm, family = binomial(), correlation = corCompSymm(form=~1|S), data = data)
model_gamm
formula_geeglm <- as.formula("O ~ 1 + offset(E) + bs(N)")
model_geeglm <- geeglm(formula_geeglm, family = binomial(), corstr = "exchangeable", id = S, data = data)
########
## Plot
########
pred_gamm <- plot.gam(model_gamm$gam, select = 1)
x <- pred_gamm[[1]]$x
pred_geeglm <- predict(model_geeglm, type = "terms", newdata = data.frame(E = rep(0,length(x)), N = x))
z <- qnorm(0.9)
tmp <- data.frame(x = x,
y = pred_gamm[[1]]$fit,
group = rep("estimate gamm",length(x)))
tmp2 <- data.frame(x = x,
y = as.numeric(pred_geeglm),
group = rep("estimate geeglm",length(x)))
tmp3 <- data.frame(x = x,
y = f(x),
group = rep("actual function",length(x)))
data_pred = bind_rows(tmp,tmp2,tmp3) %>% mutate(group = as.factor(group))
p <- ggplot(data = data_pred, aes(x = x, y = y, color = group)) +
geom_line(size = 2) +
xlab("N") +
ylab("f(N)")
p
An additional question: The gamm-object contains enough information to plot a confidence-band around the estimated function, but how can I do this for the geeglm-estimate? You get something that looks reasonable if you simulate(model_geeglm, ...) and take the pointwise mean and so on, but that doesn't really satisfy me as (1) the documentation on simulate doesn't mention marginal models and (2) it is very primitive...
The GAMM is using penalised splines, such that the degrees of freedom used by the resulting spline (smoother) is likely to be somewhat less than the requested basis dimension, which is 10. The GEE is fitting an unpenalized model. All else equal, the unpenalised model will be more wiggly than the penalised one.
To compare these approaches on a common footing, you need to make sure that bs() and s(x, bs = 'bs') both produce the same number of basis functions (The s() version can produce one fewer as it will remove the lack identifiability with the intercept term, whereas you are omitting the intercept in the bs() version).
Having assured yourself that you get the same basis dimension, then you can make GAMM fit an unpenalized spline by adding fx = TRUE to the s(...) term in the formula.
Having done that, both models should be estimating similar smooth effects.
However, I would suggest that you use penalisation; For the GAMM model, use fx = FALSE, and then after estimating the model run gam.check(model$gam) (replacing model with your fitted model object) and see if the basis size check passes for the smoother.

R language, nonlinear model formula predict

I fit an exponential formula with a set of data (x, y). then I want to calculate the y values from the formula with x values beyond the actual data set. It does't work, always prints the y values for the actual x values. Here is the code. What have I done wrong? What's the solution for my task with R language:
data <- data.frame(x=seq(1,69), y=othertable[1:69, 2])
nlsxypw <- nls(data$y ~ a*data$x^b, col2_60, start=list(a=2200000, b=0))
predict(nlsxypw)
#here I want to calculate the y values for x = 70-80
xnew <- seq(70, 80, 1)
predict(nlsxypw, xnew)
#it doesn't print these values, still the actual values for x=1~69.
This is kind of a strange feature with predict.nls (possibly other predict methods as well?), but you have to supply the new data with the same name that your model was defined in terms of:
set.seed(123)
Data <- data.frame(
x = 1:69,
y = ((1:69)**2)+rnorm(69,0,5))
nlsxypw <- nls(y ~ a*(x^b),
data=Data,
start=list(a=2.5, b=1))
##
xnew <- 70:80
## note how newdata is specified
y.pred <- predict(nlsxypw, newdata=list(x=xnew))
> y.pred
[1] 4900.355 5041.359 5184.364 5329.368 5476.373 5625.377 5776.381 5929.386 6084.390 6241.393 6400.397
##
with(
Data,
plot(x,y,pch=20,
xlim=c(0,90),
ylim=c(0,6700)))
lines(fitted(nlsxypw),col="red")
points(
x=xnew,
y=y.pred,
pch=20,
col="blue")
##

Adding error variance to output of predict()

I am attempting to take a linear model fitted to empirical data, eg:
set.seed(1)
x <- seq(from = 0, to = 1, by = .01)
y <- x + .25*rnorm(101)
model <- (lm(y ~ x))
summary(model)
# R^2 is .6208
Now, what I would like to do is use the predict function (or something similar) to create, from x, a vector y of predicted values that shares the error of the original relationship between x and y. Using predict alone gives perfectly fitted values, so R^2 is 1 e.g:
y2 <- predict(model)
summary(lm(y2 ~ x))
# R^2 is 1
I know that I can use predict(model, se.fit = TRUE) to get the standard errors of the prediction, but I haven't found an option to incorporate those into the prediction itself, nor do I know exactly how to incorporate these standard errors into the predicted values to give the correct amount of error.
Hopefully someone here can point me in the right direction!
How about simulate(model) ?
set.seed(1)
x <- seq(from = 0, to = 1, by = .01)
y <- x + .25*rnorm(101)
model <- (lm(y ~ x))
y2 <- predict(model)
y3 <- simulate(model)
matplot(x,cbind(y,y2,y3),pch=1,col=1:3)
If you need to do it it by hand you could use
y4 <- rnorm(nobs(model),mean=predict(model),
sd=summary(model)$sigma)

project a linear regression hyper plane to a 2d plot (abline-like)

I have this code
factors<-read.csv("India_Factors.csv",header=TRUE)
marketfactor<-factors[,4]
sizefactor<-factors[,5]
valuefactor<-factors[,6]
dati<-get.hist.quote("SI", quote = "AdjClose", compression = "m")
returns<-diff(dati)
regression<-lm(returns ~ marketfactor + sizefactor + valuefactor,na.action=na.omit)
that does multilinear regression.
I want to plot on a 2D plane the returns against a factor (and this is trivial of course) with superimposed the projection of the linear regression hyperplane for the specific factor. To be more clear the result should be like this: wolfram demonstrations (see the snapshots).
Any help will be greatly appreciated.
Thank you for your time and have a nice week end.
Giorgio.
The points in my comment withstanding, here is the canonical way to generate output from a fitted model in R for combinations of predictors. It really isn't clear what the plots you want are showing, but the ones that make sense to me are partial plots; where one variable is varied over its range whilst holding the others at some common value. Here I use the sample mean when holding a variable constant.
First some dummy data, with only to covariates, but this extends to any number
set.seed(1)
dat <- data.frame(y = rnorm(100))
dat <- transform(dat,
x1 = 0.2 + (0.4 * y) + rnorm(100),
x2 = 2.4 + (2.3 * y) + rnorm(100))
Fit the regression model
mod <- lm(y ~ x1 + x2, data = dat)
Next some data values to predict at using the model. You could do all variables in a single prediction and then subset the resulting object to plot only the relevant rows. Alternatively, more clearly (though more verbose), you can deal with each variable separately. Below I create two data frames, one per covariate in the model. In a data frame I generate 100 values over the range of the covariate being varied, and repeat the mean value of the other covariate(s).
pdatx1 <- with(dat, data.frame(x1 = seq(min(x1), max(x1), length = 100),
x2 = rep(mean(x2), 100)))
pdatx2 <- with(dat, data.frame(x1 = rep(mean(x1), 100),
x2 = seq(min(x2), max(x2), length = 100)))
In the linear regression with straight lines, you really don't need 100 values --- the two end points of the range of the covariate will do. However for models where the fitted function is not linear you need to predict at more locations.
Next, use the model to predict at these data points
pdatx1 <- transform(pdatx1, yhat = predict(mod, pdatx1))
pdatx2 <- transform(pdatx2, yhat = predict(mod, pdatx2))
Now we are ready to draw the partial plots. First compute a range for the y axis - again it is mostly redundant here but if you are adding confidence intervals you will need to include their values below,
ylim <- range(pdatx1$y, pdatx2$y, dat$y)
To plot (here putting two figures on the same plot device) we can use the following code
layout(matrix(1:2, ncol = 2))
plot(y ~ x1, data = dat)
lines(yhat ~ x1, data = pdatx1, col = "red", lwd = 2)
plot(y ~ x2, data = dat)
lines(yhat ~ x2, data = pdatx2, col = "red", lwd = 2)
layout(1)
Which produces

Resources