Custom model (R6 class) with caret giving binding error - r

I have created a custom model using R6 class. Due to the proprietary nature of work, I cant share the code but can share the anonymized structure. So basically, you would instantiate the model as follows
mod = my_model$new(formula = form, data = data, hyper1 = hyper1, hyper2 = hyper2)
Arguments:
formula: Standard R formula
data: Data Frame containing X and y values
hyper1 and hyper2: hyper-parameters of this model
Now, I am trying to integrate this with the caret package and having issues during the training process. This is the fit function that I am using
fitFunc <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
# Custom algorithm takes dataframe as input so need to convert 'x' and 'y' to a dataframe
# https://stats.stackexchange.com/questions/89171/help-requested-with-using-custom-model-in-caret-package
data = as.data.frame(x)
data$.outcome = y
data = as.data.frame(data)
# Define formula for model
form = formula(.outcome ~ .)
mod = my_model$new(formula = form, data = data, hyper1 = param$hyper1, hyper2 = param$hyper2, ...)
return(mod)
}
This is the code snippet for integrating the custom model with caret and running it.
library(caret)
set.seed(998)
inTraining <- createDataPartition(mtcars$mpg, p = .75, list = FALSE)
training <- mtcars[ inTraining,]
testing <- mtcars[-inTraining,]
fitControl <- trainControl(method = "cv", number = 3)
set.seed(825)
mdl_builder = train(mpg ~ ., data = training,
method = custom_model_list,
tuneLength = 8,
trControl = fitControl)
However, this leads to the following error messages (just a small snippet, it actually fails on every fold)
model fit failed for Fold2: hyper1=1, hyper2=4 Error in modelFit$xNames <- colnames(x) :
cannot add bindings to a locked environment
I think this is coming from the fact that the caret code internally is trying to assign the xNames to the R6 class object but the R6 class is not allowing this. I dont understand how to fix this (if at all it is possible). Any help would be appreciated.
Thanks!!

I figured it out right after typing this question (+ a little more research)
The key was to set lock_objects = FALSE in the R6 class to that new attributes can be added from outside.

Related

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

R: how to make predictions using gamboost

library(mboost)
### a simple two-dimensional example: cars data
cars.gb <- gamboost(dist ~ speed, data = cars, dfbase = 4,
control = boost_control(mstop = 50))
set.seed(1)
cars_new <- cars + rnorm(nrow(cars))
> predict(cars.gb, newdata = cars_new$speed)
Error in check_newdata(newdata, blg, mf) :
‘newdata’ must contain all predictor variables, which were used to specify the model.
I fit a model using the example on the help(gamboost) page. I want to use this model to predict on a new dataset, cars_new, but encountered the above error. How can I fix this?
predict function looks for a variable called speed but when you subset it with $ sign it has no name anymore.
so, this variant of prediction works;
predict(cars.gb, newdata = data.frame(speed = cars_new$speed))
or keep the original name as is;
predict(cars.gb, newdata = cars_new['speed'])

rpart giving same results for cross-validation and no CV

Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)
In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.

rpart Complexity Parameter values

rpart parameters can be found using getModelInfo
getModelInfo("rpart")[[1]]$grid
function(x, y, len = NULL, search = "grid"){
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
initialFit <- rpart(.outcome ~ .,
data = dat,
control = rpart.control(cp = 0))$cptable
initialFit <- initialFit[order(-initialFit[,"CP"]), , drop = FALSE]
if(search == "grid") {
if(nrow(initialFit) < len) {
tuneSeq <- data.frame(cp = seq(min(initialFit[, "CP"]),
max(initialFit[, "CP"]),
length = len))
} else tuneSeq <- data.frame(cp = initialFit[1:len,"CP"])
colnames(tuneSeq) <- "cp"
} else {
tuneSeq <- data.frame(cp = unique(sample(initialFit[, "CP"], size = len, replace = TRUE)))
}
tuneSeq
}
the only parameter is
cp = seq(min(initialFit[, "CP"]), max(initialFit[, "CP"]),length = len)
But how can I get the initialFit and the len?
Searching elsewhere I found that cp can usually take 10 values from 0.18 to 0.01. But still couldn't find out where those values come from
If you're unsure about appropriate values for a parameter, you can make caret choose for you and use default values. Here is an example that works end-to-end without explicitly specifying cp:
library(tidyverse)
library(caret)
library(forcats)
# Take mtcars data for example
df <- mtcars %>%
# Which cars are automatic, which ones are manual?
mutate(am = as.factor(am),
am = fct_recode(am, 'automatic' = '1', 'manual' = '0'))
set.seed(123456)
fitControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# Run rpart
# Tuning grid is left unspecified, so caret uses the default
tree1 <- train(am ~ .,
df,
method = 'rpart',
tuneLength = 20,
metric = 'ROC',
trControl = fitControl)
Alternatively, if you want to explicitly specify cp, do so using the tuning grid:
tuneGrid <- expand.grid(cp = seq(0, 0.05, 0.005))
tree2 <- train(am ~ .,
df,
method = 'rpart',
tuneLength = 20,
metric = 'ROC',
trControl = fitControl,
tuneGrid = tuneGrid)
A question on why you should select which values for cp is probably better posted on CrossValidated.
Update:
To answer your follow-on question about the default values and the values I chose in my example, I recommend going back to the primary source of the modelling function. caret is a great package for convenience reasons, but all it does is making lots of algorithms more accessible through a shared syntax. If you have a technical question about rpart, consult the package manual here.
As mentioned above, this type of question is better placed on CrossValidated, where the focus is on maths, stats, and machine learning.
However, to give you a tldr here:
The choice of tuning grid parameters is always going to be arbitrary to some extent. The objective is to find the value that produces the best results for your specific problem, which in turn depends on your data, your algorithm, and your evaluation metric. Some common "rules of thumb" include to start with a wide range, identify the area with a likely maximum and then use finer intervals around that region. In your case it is relatively easy as you only have one parameter to optimise over. Just try a couple of values and see what happens. You can plot the fitted tree object (plot(tree1)) to see how your model improves as a function of the complexity parameter cp. Eventually you will start developing a "feel" and "intuition" for what might work.

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

Resources