Caret train function for muliple data frames as function - r

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.

By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

Related

for loop in train(caret) to select different predictors in a lm model

I'm just a beginner, so i hope you can help with a problem due the KNN model (via train in caret) in R.
I tried this:
models.list = as.list(vector(length = ncol(FIFA21_db)))
for(i in 1:ncol(mtcars)) {
models.list[[i]] <- train(x = mtcars[,i], y = mtcars[,1], method = "lm")
}
This cause the error " Please use column names for x": Do you know how i can use the column names instead of observations in a for loop? My goal is to use different variables for a lm regression.

R: how to make predictions using gamboost

library(mboost)
### a simple two-dimensional example: cars data
cars.gb <- gamboost(dist ~ speed, data = cars, dfbase = 4,
control = boost_control(mstop = 50))
set.seed(1)
cars_new <- cars + rnorm(nrow(cars))
> predict(cars.gb, newdata = cars_new$speed)
Error in check_newdata(newdata, blg, mf) :
‘newdata’ must contain all predictor variables, which were used to specify the model.
I fit a model using the example on the help(gamboost) page. I want to use this model to predict on a new dataset, cars_new, but encountered the above error. How can I fix this?
predict function looks for a variable called speed but when you subset it with $ sign it has no name anymore.
so, this variant of prediction works;
predict(cars.gb, newdata = data.frame(speed = cars_new$speed))
or keep the original name as is;
predict(cars.gb, newdata = cars_new['speed'])

Custom model (R6 class) with caret giving binding error

I have created a custom model using R6 class. Due to the proprietary nature of work, I cant share the code but can share the anonymized structure. So basically, you would instantiate the model as follows
mod = my_model$new(formula = form, data = data, hyper1 = hyper1, hyper2 = hyper2)
Arguments:
formula: Standard R formula
data: Data Frame containing X and y values
hyper1 and hyper2: hyper-parameters of this model
Now, I am trying to integrate this with the caret package and having issues during the training process. This is the fit function that I am using
fitFunc <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
# Custom algorithm takes dataframe as input so need to convert 'x' and 'y' to a dataframe
# https://stats.stackexchange.com/questions/89171/help-requested-with-using-custom-model-in-caret-package
data = as.data.frame(x)
data$.outcome = y
data = as.data.frame(data)
# Define formula for model
form = formula(.outcome ~ .)
mod = my_model$new(formula = form, data = data, hyper1 = param$hyper1, hyper2 = param$hyper2, ...)
return(mod)
}
This is the code snippet for integrating the custom model with caret and running it.
library(caret)
set.seed(998)
inTraining <- createDataPartition(mtcars$mpg, p = .75, list = FALSE)
training <- mtcars[ inTraining,]
testing <- mtcars[-inTraining,]
fitControl <- trainControl(method = "cv", number = 3)
set.seed(825)
mdl_builder = train(mpg ~ ., data = training,
method = custom_model_list,
tuneLength = 8,
trControl = fitControl)
However, this leads to the following error messages (just a small snippet, it actually fails on every fold)
model fit failed for Fold2: hyper1=1, hyper2=4 Error in modelFit$xNames <- colnames(x) :
cannot add bindings to a locked environment
I think this is coming from the fact that the caret code internally is trying to assign the xNames to the R6 class object but the R6 class is not allowing this. I dont understand how to fix this (if at all it is possible). Any help would be appreciated.
Thanks!!
I figured it out right after typing this question (+ a little more research)
The key was to set lock_objects = FALSE in the R6 class to that new attributes can be added from outside.

Error converting rxGlm to GLM

I'm having a problem converting rxGlm models to normal glm models. Every time I try and covert my models I get the same error:
Error in qr.lm(object) : lm object does not have a proper 'qr' component.
Rank zero or should not have used lm(.., qr=FALSE).
Here's a simple example:
cols <- colnames(iris)
vars <- cols[!cols %in% "Sepal.Length"]
form1 <- as.formula(paste("Sepal.Length ~", paste(vars, collapse = "+")))
rx_version <- rxGlm(formula = form1,
data = iris,
family = gaussian(link = 'log'),
computeAIC = TRUE)
# here is the equivalent model with base R
R_version <- glm(formula = form1,
data = iris,
family = gaussian(link = 'log'))
summary(as.glm(rx_version)) #this always gives the above error
I cant seem to find this "qr" component (I'm assuming this is related to matrix decomposition) to specify in rxGlm formula.
Anyone else dealt with this?
rxGlm objects don't have a qr component, and converting to a glm object won't create one. This is intentional, as computing the QR decomposition of the model matrix requires the full dataset to be in memory which would defeat the purpose of using the rx* functions.
as.glm is really meant more for supporting model import/export via PMML. Most of the things that you'd want to do can be done with the rxGlm object, without converting. Eg rxGlm computes the coefficient std errors as part of the fit, without requiring a qr component afterwards.

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

Resources