How do I save models as a vector with a loop? - r

So I have this assignment where I have to create 3 different models (r). I can do them individually without a problem. However I want to take it a step further and to create a function that trains all of them with a for loop. (I know I could create a function that trained the 3 models each time. I am not looking for other solutions to the problem, I want to do it this way (or in a similar fashion) because now I have 3 models but imagine if I wanted to train 20!
I tried creating a list to store all three models, but i keep having some warnings.
library(caret)
library(readr)
library(rstudioapi)
library(e1071)
library(dplyr)
library(rpart)
TrainingFunction <- function(method,formula,data,tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
if(method == "rf") {Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)}
else if (method == "knn"){
preObj <- preProcess(data[, c(13,14,15)], method=c("center", "scale"))
data <- predict(preObj, data)
Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)
}
else if (method == "svm"){Model <- svm(formula, data = data,cost=1000 , gamma = 0.001)}
Model
}
So this is a training function I created, and it works, but now I want to train all three at once !
So I tried this:
methods <- c("rf","knn","svm")
Models <- vector(mode = "list" , length = length(methods))
for(i in 1:length(methods))
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
This are the warnings :
Warning messages:
1: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
2: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
3: In svm.default(x, y, scale = scale, ..., na.action = na.action) :
Variable(s) ‘ProductType.GameConsole’ constant. Cannot scale data.
4: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
When I do Models the output is this :
[[1]]
[1] "rf"
[[2]]
[1] "knn"
[[3]]
svm(formula = formula, data = data, cost = 1000, gamma = 0.001)

Consider switch to avoid the many if and else especially if extending to 20 models. Then use lapply to build a list without initialization or iterative assignment:
TrainingFunction <- function(method, formula, data, tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
Model <- switch(method,
"rf" = train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
"knn" = {
preObj <- preProcess(data[,c(13,14,15)],
method=c("center", "scale"))
data <- predict(preObj, data)
train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
}
"svm" = svm(formula, data = data, cost = 1000, gamma = 0.001)
)
}
methods <- c("rf","knn","svm")
Model_list <-lapply(methods, function(m)
TrainingFunction(m, Volume~., List$trainingSet, 5))

I think the problem comes from this line:
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
If you want to assign your model to the i-th place of the list, you should do it with a double bracket, like this:
{Models[[i]] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
Another alternative would be use lapply instead of an explicit loop, so you avoid that problem altogether:
train_from_method <- function(methods) {TrainingFunction(methods,Volume~.,List$trainingSet,5)}
Models <- lapply(species_vector, train_from_method)

Related

cross_Validation in oneR usuing R language

I'm trying to cross-value with the OneR algorithm and I don't quite know how to do it.With the example code I get the error "Error in x[0, , drop = FALSE] : incorrect number of dimensions"
glass <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data",
col.names=c("","RI","Na","Mg","Al","Si","K","Ca","Ba","Fe","Type")
str(glass)
head(glass)
standard.features <- scale(glass[,2:10])
data <- cbind(standard.features,glass[11])
data$Type<-factor(data$Type)
anyNA(data)
inTraining <- createDataPartition(data$Type, p = .7, list = FALSE, times =1 )
training <- data[ inTraining,]
testing <- data[-inTraining,]
set.seed(12345)
fitControl <- trainControl(## 5-fold CV
method = "cv",
number = 5
)
model <- OneR(Type~.,data= training)
oneRFit1 <- train(model,
trControl = fitControl)
It is fairly easy to write your own loop to carry out cross-validation. However, it looks like you want to use the caret package to manage it. If so, just use the method argument inside caret's train function to specify that you want to use OneR:
oneRFit1 <- train(Type~.,
data=training,
method="OneR" ,
trControl = fitControl)
str(iris)
head(iris)
set.seed(123)
inTraining <- createDataPartition(iris$Species, p = .7, list = FALSE, times =1 )
training <- iris[ inTraining,]
testing <- iris[-inTraining,]
set.seed(123)
train.control <- trainControl(method = "cv", number = 2)
# Train the model
oneRFit <- train(Species ~., data = training, method = "OneR",
trControl = train.control)

R caret extractPrediction with random forest model: Error: $ operator is invalid for atomic vectors

I want to extract the predictions for new unseen data using the function caret::extractPrediction with a random forest model but I cannot figure out, why my code throws the error Error: $ operator is invalid for atomic vectors. How should the input parameters be structured, to use this function?
Here is my reproducible code:
library(caret)
dat <- as.data.frame(ChickWeight)
# create column set
dat$set <- rep("train", nrow(dat))
# split into train and validation set
set.seed(1)
dat[sample(nrow(dat), 50), which(colnames(dat) == "set")] <- "validation"
# predictors and response
all_preds <- dat[which(dat$set == "train"), which(names(dat) %in% c("Time", "Diet"))]
response <- dat[which(dat$set == "train"), which(names(dat) == "weight")]
# set train control parameters
contr <- caret::trainControl(method="repeatedcv", number=3, repeats=5)
# recursive feature elimination caret
set.seed(1)
model <- caret::train(x = all_preds,
y = response,
method ="rf",
ntree = 250,
metric = "RMSE",
trControl = contr)
# validation set
vali <- dat[which(dat$set == "validation"), ]
# not working
caret::extractPrediction(models = model, testX = vali[,-c(3,5,1)], testY = vali[,1])
caret::extractPrediction(models = model, testX = vali, testY = vali)
# works without problems
caret::predict.train(model, newdata = vali)
I found a solution by looking at the documentation of extractPrediction. Basically, the argument models doesn't want a single model instance, but a list of models. So I just inserted list(my_rf = model) and not just model.
caret::extractPrediction(models = list(my_rf = model), testX = vali[,-c(3,5,1)], testY = vali[,1])

Error in model.frame.default(form = classvariable ~ ., data = trainingDataset, : variable lengths differ (found for 'Sepal.Length')

I've tried to look at similar questions but can't figure out my problem.
I was already able to complete my analysis with random forest (using caret), tuning parameters separately. Now I'm trying to create a function that will perform my analysis all at once.
I created a function with two inputs, the dataset, and variable to be classified.
For now I'm using the iris dataset for simplicity.
RF <- function(data, classvariable) {
# Best mtry
trControl <- trainControl(method = "cv", number = 10,
search = "grid")
set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 3))
RF_mtry <- train(classvariable ~.,
data = dataset,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
ntree = 100)
print(RF_mtry)
mtry = 0
for (i in 1:nrow(RF_mtry$results)) {
if (RF_mtry$results[i,2] > mtry) mtry <-
RF_mtry$results[i,2]
}
trial_mtry <- c(1:3)
best_mtry <- trial_mtry[i]
best_mtry
}
Once I run the function
RF(data = iris, classvariable = Species)
I get the error
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
Tried running the code without putting it in a function, so i wrote directly iris instead of dataset and Species instead of classvariable, and it works.
previously I was getting the error
Error in model.frame.default(form = classvariable ~ ., data = trainingDataset, :
variable lengths differ (found for 'Sepal.Length')
Anybody have an idea why it does not work?
Thank you very much.

Metric Accuracy not applicable for regression models

I am trying to investigate my model with R with machine learning. Training model in general works not well.
# # Logistic regression multiclass
for (i in 1:30) {
# split data into training/test
trainPhyIndex <- createDataPartition(subs_phy$Methane, p=10/17,list = FALSE)
trainingPhy <- subs_phy[trainPhyIndex,]
testingPhy <- subs_phy[-trainPhyIndex,]
# Pre-process predictor values
trainXphy <- trainingPhy[,names(trainingPhy)!= "Methane"]
preProcValuesPhy <- preProcess(x= trainXphy,method = c("center","scale"))
# using boot to avoid over-fitting
fitControlPhyGLMNET <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
savePredictions="final",
classProbs = TRUE
)
fit_glmnet_phy <- train (Methane~.,
trainingPhy,
method = "glmnet",
tuneGrid = expand.grid(
.alpha =0.1,
.lambda = 0.00023),
metric = "Accuracy",
trControl = fitControlPhyGLMNET)
pred_glmnet_phy <- predict(fit_glmnet_phy, testingPhy)
# Get the confusion matrix to see accuracy value
u <- union(pred_glmnet_phy,testingPhy$Methane)
t <- table(factor(pred_glmnet_phy, u), factor(testingPhy$Methane, u))
accu_glmnet_phy <- confusionMatrix(t)
# accu_glmnet_phy<-confusionMatrix(pred_glmnet_phy,testingPhy$Methane)
glmnetstatsPhy[(nrow(glmnetstatsPhy)+1),] = accu_glmnet_phy$overall
}
glmnetstatsPhy
The program always stopped on fit_glmnet_phy <- train (Methane~., ..
this command and shows
Metric Accuracy not applicable for regression models
I have no idea about this error
I also attached the type of mathane
enter image description here
Try normalizing the input columns and mapping the output column as factors. This helped me resolve an issue similar to it.

GAM method without resampling in caret produces stop error

I wrote a function within lapply to fit a GAM (with splines) for each element in a vector of response variables within a data frame. I opted to use caret to fit the models instead of directly using mgcv or the gam package because I would like to eventually split my data into a train/test set for validation and use various resampling techniques. For now, I simply have the trainControl method set to 'none' like so:
# Set resampling method
# tc <- trainControl(method = "boot", number = 100)
# tc <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
tc <- trainControl(method = "none")
fm <- lapply(group, function(x) {
printFormula <- paste(x, "~", inf.factors)
inputFormula <- as.formula(printFormula)
# Partition input data for model training and testing
# dpart <- createDataPartition(mdata[,x], times = 1, p = 0.7, list = FALSE)
# train <- mdata[ data.partition, ]
# test <- mdata[ -data.partition, ]
cat("Fitting:", printFormula, "\n")
# gam(inputFormula, family = binomial(link = "logit"), data = mdata)
train(inputFormula, family = binomial(link = "logit"), data = mdata, method = "gam",
trControl = tc)
})
When I execute this code, I receive the following error:
Error in train.default(x, y, weights = w, ...) :
Only one model should be specified in tuneGrid with no resampling
If I re-run the code in debugging mode, I can find where caret stops the training process:
if (trControl$method == "none" && nrow(tuneGrid) != 1)
stop("Only one model should be specified in tuneGrid with no resampling")
Clearly the train function fails because of the second condition, but when I look up the tuning parameters for a GAM (with splines) there is only an option for feature selection (not interested, I want to keep all the predictors in the model) and the method. Consequently, I do not include a tuneGrid data frame when I call train. Is this the reason why the model is failing in this way? What parameter would I provide and what would the tuneGrid look like?
I should add that the model is trained successfully when I use bootstrapping or k-fold CV, however these resampling methods take much longer to calculate and I do not need to use them yet.
Any help on this issue would be appreciated!
For that model, the tuning grid looks over two values of the select parameters:
> getModelInfo("gam", regex = FALSE)[[1]]$grid
function(x, y, len = NULL, search = "grid") {
if(search == "grid") {
out <- expand.grid(select = c(TRUE, FALSE), method = "GCV.Cp")
} else {
out <- data.frame(select = sample(c(TRUE, FALSE), size = len, replace = TRUE),
method = sample(c("GCV.Cp", "ML"), size = len, replace = TRUE))
}
out[!duplicated(out),]
}
You should use something like tuneGrid = data.frame(select = FALSE, method = "GCV.Cp") to only evaluate a single model (as error message says).

Resources