How exactly do you predict in gam? With reproducible example - r

How do you predict in mgcv::gam when you've fitted a model that might contain random effects?
The other thread on this site with the "exclude" trick does not work for me (https://stats.stackexchange.com/questions/131106/predicting-with-random-effects-in-mgcv-gam)
ya <- rnorm(100, 0, 1)
yb <- rnorm(100,0,1.5)
yc <- rnorm(100, 0, 2)
yd <- rnorm(100, 0, 2.5)
yy <- c(ya,yb,yc,yd) #so, now we've got data from 4 different groups.
xx <- c(rep("a", 100), rep("b",100), rep("c",100),rep("d",100)) #groups
zz <- rnorm(400,0,1) #some other covariate
model <- gam(yy ~ zz + s(xx, bs = "re")) #the model
predictdata <- data.frame( zz = 5 ) #new data
predict(model, newdata = predictdata, exclude = "s(xx)") #prediction
and this produces the error
Error in model.frame.default(ff, data = newdata, na.action = na.act) :
variable lengths differ (found for 'xx')
In addition: Warning messages:
1: In predict.gam(model, newdata = predictdata, exclude = "s(xx)") :
not all required variables have been supplied in newdata!
2: 'newdata' had 1 row but variables found have 400 rows
My mgcv package is the latest.
EDIT:
If you change predictdata to
predictdata <- data.frame(zz = 5, xx = "f")
then it says
Error in predict.gam(model, newdata = predictdata, exclude = "s(xx)") :
f not in original fit

I experimented with your example and it seems that the 'exclude' statement does work, even though you have to specify in newdata values for the random effects that were included in the original dataset used to fit the model. This however, makes me a bit uneasy. Another caveat is that 'exclude' did not seem to work on a model with a variance structure that was estimated separately by group (I tried this with another dataset), i.e., something like s(xx, s="re", by=group). You might want to post or have the question moved to crossvalidated so that other statisticians/analysts can see it an perhaps provide a better answer.
Below is my code. Note that I changed the means for groups a and d, yet the overall mean should be around zero.
ya <- rnorm(100, 1, 1)
yb <- rnorm(100, 0,1.5)
yc <- rnorm(100, 0, 2)
yd <- rnorm(100, -1, 2.5)
yy <- c(ya,yb,yc,yd) #so, now we've got data from 4 different groups.
xx <- c(rep("a", 100), rep("b",100), rep("c",100),rep("d",100)) #groups
zz <- rnorm(400,0,1) #some other covariate
some.data= data.frame(yy,xx,zz)
model <- gam(yy ~ zz + s(xx, bs = "re"),data=some.data) #the model
# the intercept is the overall mean when zz is zero
summary(model)
predictdata <- data.frame(zz = c(0,0,0,0), xx =c("a","b","c","d")) #new data
#excluding random effects. Estimate should be the same for all and should be the intercept
predict(model, newdata = predictdata, exclude = "s(xx)")
#including random effects. Estimates should differ by group with 'a' larger and 'd' smaller
predict(model, newdata = predictdata)

Related

Does caret::train() in r have a standardized output across different fit methods/models?

I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)

Misuse predict.rq in the package quantreg?

I am using quantreg package to predict new data based on training set. However, I noticed a discrepancy between predict.rq or predict and doing it manually. Here is an example:
The quantile regression setting is
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
The new data set I want to predict is
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
I use predict.rq or predict to predict newdata. Both return the same result:
fit_use_predict <- predict.rq( fit, newdata = as.data.frame(newdata) )
Also I manually do the prediction based on the coefficients matrix:
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
I expect both are numerically identical, but they are not:
diff <- fit_use_predict - fit_use_multiplication
print(diff)
Their difference cannot be negligible.
However, predicting the original data set X, both return the same result, i.e.,
predict(fit, newdata = data.frame(X)) = X %*% coef_mat ## True
Do I miss something when using the function? Thanks!
A more serious problem here, before we get to prediction is that the model is forcing all of the fitted quantile functions through the origin of design space and since the covariates are centered at the origin all of the quantile functions are forced to cross there. Even if the X's all lie in the positive orthant it is quite a strong assumption to say that the distribution of the response is degenerate at the origin.
I think you just have to retain the 'X' name in your data as it was in the training data.
library(quantreg)
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
fit_use_predict <- predict.rq( fit, newdata = data.frame(X=I(newdata)) )
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
diff <- fit_use_predict - fit_use_multiplication
max( abs(diff) )
Output is 0

What exactly is happening in these models when an intercept is removed from a mixed effects model?

I have the following data:
set.seed(3)
library(data.table)
library(lme4)
a <- rep(1:5, times = 20)
b <- rep(c(1,1,1,1,1,2,2,2,2,2), times = 50)
ppt <- rep(101:110, each = 10)
item <- rep(1:10, times = 10)
dv <- rnorm(n = 100)
contrasts(data$a) = contr.sum(4)
data <- data.table(cbind(ppt, item, a, b, dv))
data$ppt <- as.factor(data$ppt)
data$item <- as.factor(data$item)
data$a <- as.factor(data$a)
data$b <- as.factor(data$b)
I would like to get a coefficient for each level of a. u/omsa_d00d and u/dead-serious pointed me to the idea of running a model without an intercept.
If I run this model:
m1 <- lmer(dv ~ a + b -1 +(1|ppt) + (1|item), data = data)
I get coefficients for each level of a.
However if I run this model in which b comes first:
m2 <- lmer(dv ~ b + a -1 +(1|ppt) + (1|item), data = data)
I get coefficients for each level of b, but not a.
What exactly is happening in each case?
Additionally, is running m1 sufficient to get an effect of each level of a compared to the grand mean, while also controlling for b?
Does it matter if I mean centre my predictors first?
What are the different implications of dummy vs. sum coding factor a?

Implementing multinomial-Poisson transformation with multilevel models

I know variations of this question have been asked before but I haven't yet seen an answer on how to implement the multinomial Poisson transformation with multilevel models.
I decided to make a fake dataset and follow the method outlined here, also consulting the notes the poster mentions as well as the Baker paper on MP transformation.
In order to check if I'm doing the coding correctly, I decided to create a binary outcome variable as a first step; because glmer can handle binary response variables, this will let me check I'm correctly recasting the logit regression as multiple Poissons.
The context of this problem is running multilevel regressions with survey data where the outcome variable is response to a question and the possible predictors are demographic variables. As I mentioned above, I wanted to see if I could properly code the binary outcome variable as a Poisson regression before moving on to multi-level outcome variables.
library(dplyr)
library(lme4)
key <- expand.grid(sex = c('Male', 'Female'),
age = c('18-34', '35-64', '45-64'))
set.seed(256)
probs <- runif(nrow(key))
# Make a fake dataset with 1000 responses
n <- 1000
df <- data.frame(sex = sample(c('Male', 'Female'), n, replace = TRUE),
age = sample(c('18-34', '35-64', '45-64'), n, replace = TRUE),
obs = seq_len(n), stringsAsFactors = FALSE)
age <- model.matrix(~ age, data = df)[, -1]
sex <- model.matrix(~ sex, data = df)[, -1]
beta_age <- matrix(c(0, 1), nrow = 2, ncol = 1)
beta_sex <- matrix(1, nrow = 1, ncol = 1)
# Create class probabilities as a function of age and sex
probs <- plogis(
-0.5 +
age %*% beta_age +
sex %*% beta_sex +
rnorm(n)
)
id <- ifelse(probs > 0.5, 1, 0)
df$y1 <- id
df$y2 <- 1 - df$y1
# First run the regular hierarchical logit, just with a varying intercept for age
glm_out <- glmer(y1 ~ (1|age), family = 'binomial', data = df)
summary(glm_out)
#Next, two Poisson regressions
glm_1 <- glmer(y1 ~ (1|obs) + (1|age), data = df, family = 'poisson')
glm_2 <- glmer(y2 ~ (1|obs) + (1|age), data = df, family = 'poisson')
coef(glm_1)$age - coef(glm_2)$age
coef(glm_out)$age
The outputs for the last two lines are:
> coef(glm_1)$age - coef(glm_2)$age
(Intercept)
18-34 0.14718933
35-64 0.03718271
45-64 1.67755129
> coef(glm_out)$age
(Intercept)
18-34 0.13517758
35-64 0.02190587
45-64 1.70852847
These estimates seem close but they are not exactly the same. I'm thinking I've specified an equation wrong with the intercept.

Simulating a mixed linear model and evaluating it with lmerTest in R

I am trying to understand how to use mixed linear models to analyse my data by simulating a model, but I can't reproduce the input parameters. What am I missing?
I want to start simulating a model with a random intercept for each subject. Here is the formula of what I want to simulate and reproduce:
If beta1 (<11) is small I find gamma00 as the intercept in fixed section, but I am completedly unaable to retrieve the slope (beta1). Also, the linear effect is not significant. Where is my conceptual mistake?
library(lmerTest)
# Generating data set
# General values and variables
numObj <- 20
numSub <- 100
e <- rnorm(numObj * numSub, mean = 0, sd = 0.1)
x <- scale(runif(numObj * numSub, min = -100, max = 100))
y <- c()
index <- 1
# Coefficients
gamma00 <- 18
gamma01 <- 0.5
beta1 <- -100
w <- runif(numSub, min = -3, max = 3)
uo <- rnorm(numSub, mean = 0, sd = 0.1)
meanBeta0 <- mean(gamma00 + gamma01*w + uo) # I should be able to retrieve that parameter.
for(j in 1:numSub){
for(i in 1:numObj){
y[index] <- gamma00 + gamma01*w[j]+ uo[j] + beta1*x[i] + e[index]
index <- index + 1
}
}
dataFrame2 <- data.frame(y = y, x = x, subNo = factor(rep(1:numSub, each = numObj)), objNum = factor(rep(1:numObj, numSub)))
model2 <- lmer(y ~ x +
(1 | subNo), data = dataFrame2)
summary(model2)
anova(model2)
No conceptual mistake here, just a mixed up index value: you should be using index rather than i to index x in your data generation loop.
Basically due to the mix-up you were using the first subject's x values for generating data for all the subjects, but using the individual x values in the model.

Resources