Pass model formula as argument in R - r

I need to cross-validate several glmer models on the same data so I've made a function to do this (I'm not interested in preexisting functions for doing this). I want to pass an arbitrary glmer model to my function as the only argument. Sadly, I can't figure out how to do this, and the interwebz won't tell me.
Ideally, I would like to do something like:
model = glmer(y ~ x + (1|z), data = train_folds, family = "binomial"
model2 = glmer(y ~ x2 + (1|z), data = train_folds, family = "binomial"
And then call cross_validation_function(model) and cross_validation_function(model2). The training data within the function is called train_fold.
However, I suspect I need to pass the model formula in different way using reformulate.
Here is an example of my function. The project is about predicting autism(ASD) from behavioral features. The data variable is da.
library(pacman)
p_load(tidyverse, stringr, lmerTest, MuMIn, psych, corrgram, ModelMetrics,
caret, boot)
cross_validation_function <- function(model){
#creating folds
participants = unique(da$participant)
folds <- createFolds(participants, 10)
cross_val <- sapply(seq_along(folds), function(x) {
train_folds = filter(da, !(as.numeric(participant) %in% folds[[x]]))
predict_fold = filter(da, as.numeric(participant) %in% folds[[x]])
#model to be tested should be passed as an argument here
train_model <- model
predict_fold <- predict_fold %>%
mutate(predictions_perc = predict(train_model, predict_fold, allow.new.levels = T),
predictions_perc = inv.logit(predictions_perc),
predictions = ifelse(predictions_perc > 0.5, "ASD","control"))
conf_mat <- caret::confusionMatrix(data = predict_fold$predictions, reference = predict_fold$diagnosis, positive = "ASD")
accuracy <- conf_mat$overall[1]
sensitivity <- conf_mat$byClass[1]
specificity <- conf_mat$byClass[2]
fixed_ef <- fixef(train_model)
output <- c(accuracy, sensitivity, specificity, fixed_ef)
})
cross_df <- t(cross_val)
return(cross_df)
}
Solution developed from the comment: Using as.formula strings can be converted into a formula which can passed as arguments to my function in the following way:
cross_validation_function <- function(model_formula){
...
train_model <- glmer(model_formula, data = da, family = "binomial")
...}
formula <- as.formula( "y~ x + (1|z"))
cross_validation_function(formula)

If you aim is to extract the model formula from a fitted model, the you can use
attributes(model)$call[[2]]. Then you can use this formula when fitting model with the cv folds.
mod_formula <- attributes(model)$call[[2]]
train_model = glmer(mod_formula , data = train_data,
family = "binomial")

Related

Retrain best model on full dataset in R

I have two models to select from and using some criteria I choose one of the two. (The below is just an example, I know it doesn't make much sense)
library(forecast)
set.seed(4)
sample_dat= sample(1:nrow(cars), 5)
train = cars[-sample_dat, ]
test = cars[sample_dat, ]
models = list(lm(dist ~ speed, train), glm(dist ~ speed, train, family = "poisson"))
test_res = sapply(models, function(x) accuracy(predict(x, test, type = "response"), test$dist)[2]) #Getting the RMSE for each model
best_model = models[which.min(test_res)]
How can I retrain the best model using the full dataset (train + test)? I checked the update and update.formula functions but these don't seem to be updating the data part.
update(best_model[[1]],data = rbind(train,test))
You do not want to change the formula since that is the best model but rather update the data
Base R using your own logic, first creating a list mirroring the models list:
set.seed(4)
sample_dat= sample(1:nrow(cars), 5)
train = cars[-sample_dat, ]
test = cars[sample_dat, ]
models = list(lm(dist ~ speed, train), glm(dist ~ speed, train, family = "poisson"))
model_application = list(as.expression("lm(dist ~ speed, cars)$call"),
as.expression("glm(dist ~ speed, cars, family = 'poisson'))$call"))
test_res = sapply(models,
function(x){
# Store a function to caclulate the RMSE: rmse => function
rmse <- function(actual_vec, pred_vec){sqrt(mean((pred_vec - actual_vec)**2))}
# Getting the RMSE for each model: numeric scalar => .GlobalEnv
rmse(test$dist, predict(x, data = test, type = "response"))
}
)
best_model = models[[which.min(test_res)]]
applied_model <- eval(eval(as.expression(parse(text = model_application[[which.min(test_res)]]))))

predict function on the 'grplasso' package

I use 'grplasso' package for train and test datasets. I find the best lambda (minimum AIC) by the fitting model on train dataset. This lambda name is 'lambdaopt'.
BestTrainFit <- grplasso(Outcome ~. , data = traindata, lambda = lambdaopt, model = LogReg(), center = TRUE,standardize = TRUE)
I want to calculate performance model on the test dataset. So, Which ways below corrected?
1. The calculation 'grplasso' model again by 'lambdaopt' on the test dataset
BestTestFit <- grplasso(Outcome ~. , data = testdata, lambda = lambdaopt, model = LogReg(), center = TRUE,standardize = TRUE)
p1 = BestTestFit$fitted
Using the 'predict' function on the 'grplasso' package
p2 = predict(BestTrainFit,testdata,type = 'response')

Caret returns different predictions with caret train object than it does with the extracted final model

I prefer to use caret when fitting models because of its relative speed and preprocessing capabilities. However, I'm slightly confused on how it makes predictions. When comparing predictions made directly from the train object and predictions made from the extracted final model, I'm seeing very different numbers. The predictions from the train object appear to be more accurate.
library(caret)
library(ranger)
x1 <- rnorm(100)
x2 <- rbeta(100, 1, 1)
y <- 2*x1 + x2 + 5*x1*x2
data <- data.frame(x1, x2, y)
fitRanger <- train(y ~ x1 + x2, data = data,
method = 'ranger',
tuneLength = 1,
preProcess = c('knnImpute', 'center', 'scale'))
predict.data <- data.frame(x1 = rnorm(10), x2 = rbeta(10, 1, 1))
prediction1 <- predict(fitRanger, newdata = predict.data)
prediction2 <- predict(fitRanger$finalModel, data = predict.data)$prediction
results <- data.frame(prediction1, prediction2)
results
I'm positive it has something to do with how I preprocess the data in the train object, but even when I preprocess the test data and use the Ranger model to make predictions, the values are different
predict.data.processed <- predict.data %>%
preProcess(method = c('knnImpute',
'center',
'scale')) %>% .$data
results3 <- predict(fitRanger$finalModel, data = predict.data.processed)$prediction
results <- cbind(results, results3)
results
I want to extract the predictions from each individual tree in the ranger model, which I can't do in caret. Any thoughts?
In order to get the same predictions from the final model as with caret train you should pre-process the data in the same way. Using your example with set.seed(1):
caret predict:
prediction1 <- predict(fitRanger,
newdata = predict.data)
ranger predict on the final model. caret pre process was used on predict.data
prediction2 <- predict(fitRanger$finalModel,
data = predict(fitRanger$preProcess,
predict.data))$prediction
all.equal(prediction1,
prediction2)
#output
TRUE

Function for Logistic Regression Training Set

I am trying to create a function to test a logistic regression model developed on a training set.
For example
train <- filter(y, folds != i)
test <- filter(y, folds == i)
I want to be able to use the formula for different data sets.
For example, if I were to take y to be a response variable such as “low” in the birthwt data set and x to be the explanatory variables e.g. “age", “race” how would I implement these arguments into glm.train formula without having to type the function separately for different data sets ?
glm.train <- glm(y ~x, family = binomial, data = train)
You can use reformulate to create a formula based on strings:
x <- c("age", "race")
y <- "low"
form <- reformulate(x, response = y)
# low ~ age + race
Use this formula for glm:
glm.train <- glm(form, family = binomial, data = train)

ANOVA after using glm.fit

I would like to perform a likelihood ratio test to determine the power of a model term in a DOE. Till now I have been using the p-value from the glm fit to do this and things have been fine. As I started to use the anova function, I realized that there does not seem to be an anova function designed to accept the input from a glm.fit function, only a glm function. Here is an example of what I would like to do:
X # This is a model matrix from matrix.model
y # These are the y values for the fit
tfit = glm.fit(X, y, family = poisson())
anova(tfit, test = 'LRT')
Typically I would assume that the anova function call would just need to be altered to anova.glm, but that is not the case. How can I get the glm.fit function output to be compatible with an anova function input?
The problem is that glm.fit does not output of class glm, but a raw list with all kinds of data about the model. This cannot be fed to anova.glm since this function expects an object of class glm as produced by the glm function. If you have the raw data available (thus not turned in to a model matrix, you can apply the glm function to this to produce the desired outcome.
X <- matrix(c(runif(10), rnorm(10)), ncol = 2)
y <- round(runif(10, 1, 5))
X.mm <- model.matrix(y ~ X)
model.fit.1 <- glm.fit(X.mm, y, family = poisson())
class(model.fit.1)
model.fit.2 <- glm(y ~ X, family = "poisson")
class(model.fit.2)
anova(model.fit.2, test = "LRT")
If you can't use the glm function and must use the glm.fit then you can construct the LRT yourself from the glm.fit output. For a start take the following function
LRT.glm.fit <- function(glm.fit.mod){
df.null <- glm.fit.mod$df.null
df.mod <- glm.fit.mod$df.residual
dev.null <- glm.fit.mod$null.deviance
dev.mod <- glm.fit.mod$deviance
dev.diff <- dev.null - dev.mod
p.value <- 1 - pchisq(dev.null - dev.mod, df.null - df.mod)
output <- c(round(df.null), round(df.mod), dev.null, dev.mod, p.value)
names(output) <- c("df.null", "df.mod", "dev.null", "dev.mod", "p.value")
output
}

Resources