Retrain best model on full dataset in R - r

I have two models to select from and using some criteria I choose one of the two. (The below is just an example, I know it doesn't make much sense)
library(forecast)
set.seed(4)
sample_dat= sample(1:nrow(cars), 5)
train = cars[-sample_dat, ]
test = cars[sample_dat, ]
models = list(lm(dist ~ speed, train), glm(dist ~ speed, train, family = "poisson"))
test_res = sapply(models, function(x) accuracy(predict(x, test, type = "response"), test$dist)[2]) #Getting the RMSE for each model
best_model = models[which.min(test_res)]
How can I retrain the best model using the full dataset (train + test)? I checked the update and update.formula functions but these don't seem to be updating the data part.

update(best_model[[1]],data = rbind(train,test))
You do not want to change the formula since that is the best model but rather update the data

Base R using your own logic, first creating a list mirroring the models list:
set.seed(4)
sample_dat= sample(1:nrow(cars), 5)
train = cars[-sample_dat, ]
test = cars[sample_dat, ]
models = list(lm(dist ~ speed, train), glm(dist ~ speed, train, family = "poisson"))
model_application = list(as.expression("lm(dist ~ speed, cars)$call"),
as.expression("glm(dist ~ speed, cars, family = 'poisson'))$call"))
test_res = sapply(models,
function(x){
# Store a function to caclulate the RMSE: rmse => function
rmse <- function(actual_vec, pred_vec){sqrt(mean((pred_vec - actual_vec)**2))}
# Getting the RMSE for each model: numeric scalar => .GlobalEnv
rmse(test$dist, predict(x, data = test, type = "response"))
}
)
best_model = models[[which.min(test_res)]]
applied_model <- eval(eval(as.expression(parse(text = model_application[[which.min(test_res)]]))))

Related

How do I run a logit/probit model only with an intercept in R aod?

I am doing this for econometrics purposes and want to test the significance of all coefficients, it just would be better for my calculations. I have tried regressing the variable on itself and it returned some result with only an intercept, but i am not sure if this is correct.
myprobit0 <- glm(target ~ target,
data = data_numeric, family = binomial(link = "probit") )
Could you please suggest a proper way to only estimate the model with a constant?
vecs = rep(1 ,length(target)) // give length of your data as second argument
myprobit0 <- glm(target ~ vecs ,
data = data_numeric, family = binomial(link = "probit") )

predict function on the 'grplasso' package

I use 'grplasso' package for train and test datasets. I find the best lambda (minimum AIC) by the fitting model on train dataset. This lambda name is 'lambdaopt'.
BestTrainFit <- grplasso(Outcome ~. , data = traindata, lambda = lambdaopt, model = LogReg(), center = TRUE,standardize = TRUE)
I want to calculate performance model on the test dataset. So, Which ways below corrected?
1. The calculation 'grplasso' model again by 'lambdaopt' on the test dataset
BestTestFit <- grplasso(Outcome ~. , data = testdata, lambda = lambdaopt, model = LogReg(), center = TRUE,standardize = TRUE)
p1 = BestTestFit$fitted
Using the 'predict' function on the 'grplasso' package
p2 = predict(BestTrainFit,testdata,type = 'response')

Pass model formula as argument in R

I need to cross-validate several glmer models on the same data so I've made a function to do this (I'm not interested in preexisting functions for doing this). I want to pass an arbitrary glmer model to my function as the only argument. Sadly, I can't figure out how to do this, and the interwebz won't tell me.
Ideally, I would like to do something like:
model = glmer(y ~ x + (1|z), data = train_folds, family = "binomial"
model2 = glmer(y ~ x2 + (1|z), data = train_folds, family = "binomial"
And then call cross_validation_function(model) and cross_validation_function(model2). The training data within the function is called train_fold.
However, I suspect I need to pass the model formula in different way using reformulate.
Here is an example of my function. The project is about predicting autism(ASD) from behavioral features. The data variable is da.
library(pacman)
p_load(tidyverse, stringr, lmerTest, MuMIn, psych, corrgram, ModelMetrics,
caret, boot)
cross_validation_function <- function(model){
#creating folds
participants = unique(da$participant)
folds <- createFolds(participants, 10)
cross_val <- sapply(seq_along(folds), function(x) {
train_folds = filter(da, !(as.numeric(participant) %in% folds[[x]]))
predict_fold = filter(da, as.numeric(participant) %in% folds[[x]])
#model to be tested should be passed as an argument here
train_model <- model
predict_fold <- predict_fold %>%
mutate(predictions_perc = predict(train_model, predict_fold, allow.new.levels = T),
predictions_perc = inv.logit(predictions_perc),
predictions = ifelse(predictions_perc > 0.5, "ASD","control"))
conf_mat <- caret::confusionMatrix(data = predict_fold$predictions, reference = predict_fold$diagnosis, positive = "ASD")
accuracy <- conf_mat$overall[1]
sensitivity <- conf_mat$byClass[1]
specificity <- conf_mat$byClass[2]
fixed_ef <- fixef(train_model)
output <- c(accuracy, sensitivity, specificity, fixed_ef)
})
cross_df <- t(cross_val)
return(cross_df)
}
Solution developed from the comment: Using as.formula strings can be converted into a formula which can passed as arguments to my function in the following way:
cross_validation_function <- function(model_formula){
...
train_model <- glmer(model_formula, data = da, family = "binomial")
...}
formula <- as.formula( "y~ x + (1|z"))
cross_validation_function(formula)
If you aim is to extract the model formula from a fitted model, the you can use
attributes(model)$call[[2]]. Then you can use this formula when fitting model with the cv folds.
mod_formula <- attributes(model)$call[[2]]
train_model = glmer(mod_formula , data = train_data,
family = "binomial")

How to stack machine learning models in R

I am new to machine learning and R.
I know that there is an R package called caretEnsemble, which could conveniently stack the models in R. However, this package looks has some problems when deals with multi-classes classification tasks.
Temporarily, I wrote some codes to try to stack the models manually and here is the example I worked on:
library(caret)
set.seed(123)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis, predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3 / 4)[[1]]
training = adData[inTrain,]
testing = adData[-inTrain,]
set.seed(62433)
modelFitRF <- train(diagnosis ~ ., data = training, method = "rf")
modelFitGBM <- train(diagnosis ~ ., data = training, method = "gbm",verbose=F)
modelFitLDA <- train(diagnosis ~ ., data = training, method = "lda")
predRF <- predict(modelFitRF,newdata=testing)
predGBM <- predict(modelFitGBM, newdata = testing)
prefLDA <- predict(modelFitLDA, newdata = testing)
confusionMatrix(predRF, testing$diagnosis)$overall[1]
#Accuracy
#0.7682927
confusionMatrix(predGBM, testing$diagnosis)$overall[1]
#Accuracy
#0.7926829
confusionMatrix(prefLDA, testing$diagnosis)$overall[1]
#Accuracy
#0.7682927
Now I've got three models: modelFitRF, modelFitGBM and modelFitLDA, and three predicted vectors corresponding to such three models based on the test set.
Then I will create a data frame to contain these predicted vectors and the original dependent variable in the test set:
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
And then, I just used such data frame as a new train set to create a stacked model:
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]
#Accuracy
#0.804878
Considering that stacking models usually should improve the accuracy of the predictions, I'de like to believe this might be a right to stack the models. However, I also doubt that here I used the predDF which is created by the predictions from three models with the test set.
I am not sure whether I should use the results from the test set and then apply them back to the test set to get final predictions?
(I am referring to this block below:)
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]

Bagging in R: How to use train with the bag function

In R, I am trying to use the bag function with the train function. I start with using train and rpart for a classification tree model, on the simple iris data set. Now I want to create a bag of such 10 trees with the bag function. The documentation says that the aggregate parameter must be a function to choose a value from all bagged models, so I created one called agg, which chooses the string of greatest frequency. However, the bag function gives the following error:
Error in fitter(btSamples[[iter]], x = x, y = y, ctrl = bagControl, v = vars, :
task 1 failed - "attempt to apply non-function"
Here is my complete code:
# Use bagging to create a bagged classification tree from 10 classification trees created with rpart.
data(iris)
# Create training and testing data sets:
inTrain = createDataPartition(y=iris$Species, p=0.7, list=F)
train = iris[inTrain,]
test = iris[-inTrain,]
# Create regressor and outcome datasets for bag function:
regressors = train[,-5]
species = train[,5]
# Create aggregate function:
agg = function(x, type) {
y = count(x)
y = y[order(y$freq, decreasing=T),]
as.character(y$x[1])
}
# Create bagged trees with bag function:
treebag = bag(regressors, species, B=10,
bagControl = bagControl(fit = train(data=train, species ~ ., method="rpart"),
predict = predict,
aggregate = agg
)
)
This gives the error message stated above. I don't understand why it rejects the agg function.
from ?bag()
When using bag with train, classification models should use type =
"prob" inside of the predict function so that predict.train(object,
newdata, type = "prob") will work.
So I guess you might want to try:
bagControl = bagControl(fit = train(data=train, species ~ .,
method="rpart", type="prob"),
predict = predict,
aggregate = agg
)

Resources