Run cforest with controls = cforest_unbiased() using caret package - r

I would like to run an unbiased cforest using the caret package. Is this possible?
tc <- trainControl(method="cv",
number=f,
index=indexList,
savePredictions=T,
classProbs = TRUE,
summaryFunction = twoClassSummary)
createCfGrid <- function(len, data) {
g = createGrid("cforest", len, data)
g = expand.grid(.controls = cforest_unbiased(mtry = 5, ntree = 1000))
return(g)
}
set.seed(1)
(cfMatFit <- train(as.factor(f1win) ~ .,
data=df,
method="cforest",
metric="ROC",
trControl=tc,
tuneGrid = createCfGrid))
The error is Error in as.character.default(<S4 object of class "ForestControl">) :
no method for coercing this S4 class to a vector
This is because cforest_control() can not be coerced into a data frame. The function does work if I use:
...
g = expand.grid(.mtry = 5)
...
However if I want to change ntree, this has no effect:
...
g = expand.grid(.mtry = 5, .ntree = 1000)
...
This does not error like randomForest does.

The grid should be a simple data frame with a column called .mtry. The code
g = createGrid("cforest", len, data)
will generate that for you. If you want to specify ntree you just pass a controls object in as another argument to train but leave out mtry:
mod <- train(Species ~ ., data = iris,
method = "cforest",
controls = cforest_unbiased(ntree = 10))
caret takes care of changing mtry for you.
Max

Related

How to implement knn based on weights

I would like to implement the weighted knn algorithm but I don't know how to do it. Everything and that I can use kknn, I suppose that it can also be done with knn. In the function train(caret) there is an option "weights" but I can't find the solution, any suggestion?
I use the following code in R :
library(caret)
library(corrplot)
glass <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data",
col.names=c("","RI","Na","Mg","Al","Si","K","Ca","Ba","Fe","Type"))
str(glass)
head(glass)
glass_1<- glass[,-7]
glass_2<- glass_1[,-7]
head(glass_2)
glass<- glass_2
standard.features <- scale(glass[,2:8])
data <- cbind(standard.features,glass[9])
anyNA(data)
head(data)
corrplot(cor(data))
data$Type<-factor(data$Type)
inTraining <- createDataPartition(data$Type, p = .7, list = FALSE, times =1 )
training <- data[ inTraining,]
testing <- data[-inTraining,]
prop.table(table(training$Type))
prop.table(table(testing$Type))
dim(training); dim(testing);
summary(data)
fitControl <- trainControl(## 5-fold CV
method = "cv",
number = 5,
## repeated ten times
#repeats = 5)
)
#k_value <- expand.grid(kmax = 3, distance = 2, kernel = "optimal")
k_value <- expand.grid(k = 3)
set.seed(825)
knn_Fit <- train(Type ~ ., data = training, weights = ????,
method = "knn", tuneGrid = k_value,
trControl = fitControl)
## This last option is actually one
## for gbm() that passes through
#verbose = FALSE)
knn_Fit
knn_Fit$finalModel

Select tuneGrid depending on the model in caret R

I try to apply ML on the iris dataset, using "knn" and "rpart" algorithms. This is my code:
library(tidyverse)
library(caret)
dataset <- iris
tt_index <- createDataPartition(dataset$Sepal.Length, times = 1, p = 0.9, list = FALSE)
train_set <- dataset[tt_index, ]
test_set <- dataset[-tt_index, ]
models <- c("knn","rpart")
fits <- lapply(models, function(model){
print(model)
train(Species ~ .,
data = train_set,
tuneGrid = case_when(model == "knn" ~ data.frame(k = seq(3,50,1)),
model == "rpart" ~ data.frame(cp = seq(0,0.1,len = 50))),
method = model)
})
I want to set tuneGrid parameter depending on the model inside lapply. But I receive this error:
Error in `[.data.frame`(value[[1]], rep(NA_integer_, m)) :
undefined columns selected
Any help will be greatly appreciated.
We could use if/else
library(caret)
out <- lapply(models, function(model)
train(Species ~ ., data = train_set,
tuneGrid = if(model == "knn") data.frame(k = seq(3,50,1)) else
data.frame(cp = seq(0,0.1,len = 50)), method = model))
According to ?case_when
A vector of length 1 or n, matching the length of the logical input or output vectors, with the type (and attributes) of the first RHS. Inconsistent lengths or types will generate an error.

How does setting preProcess argument in train function in Caret work?

I am trying to predict the times table training a neural network. However, I couldn't really get how preProcess argument works in train function in Caret.
In the docs, it says:
The preProcess class can be used for many operations on predictors, including centering and scaling.
When we set preProcess like below,
tt.cv <- train(product ~ .,
data = tt.train,
method = 'neuralnet',
tuneGrid = tune.grid,
trControl = train.control,
linear.output = TRUE,
algorithm = 'backprop',
preProcess = 'range',
learningrate = 0.01)
Does it mean that the train function preprocesses (normalizes) the training data passed, in this case tt.train?
After the training is done, when we are trying to predict, do we pass normalized inputs to the predict function or are inputs normalized in the function because we set the preProcess parameter?
# Do we do
predict(tt.cv, tt.test)
# or
predict(tt.cv, tt.normalized.test)
And from the quote above, it seems that when we use preProcess, outputs are not normalized this way in training, how do we go about normalizing outputs? Or do we just normalize the training data beforehand like below and then pass it to the train function?
preProc <- preProcess(tt, method = 'range')
tt.preProcessed <- predict(preProc, tt)
tt.preProcessed.train <- tt.preProcessed[indexes,]
tt.preProcessed.test <- tt.preProcessed[-indexes,]
The whole code:
library(caret)
library(neuralnet)
# Create the dataset
tt = data.frame(multiplier = rep(1:10, times = 10), multiplicand = rep(1:10, each = 10))
tt = cbind(tt, data.frame(product = tt$multiplier * tt$multiplicand))
# Splitting
indexes = createDataPartition(tt$product,
times = 1,
p = 0.7,
list = FALSE)
tt.train = tt[indexes,]
tt.test = tt[-indexes,]
# Pre-process
preProc <- preProcess(tt, method = c('center', 'scale'))
tt.preProcessed <- predict(preProc, tt)
tt.preProcessed.train <- tt.preProcessed[indexes,]
tt.preProcessed.test <- tt.preProcessed[-indexes,]
# Train
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
savePredictions = TRUE)
tune.grid <- expand.grid(layer1 = 8,
layer2 = 0,
layer3 = 0)
tt.cv <- train(product ~ .,
data = tt.train,
method = 'neuralnet',
tuneGrid = tune.grid,
trControl = train.control,
algorithm = 'backprop',
learningrate = 0.01,
stepmax = 100000,
preProcess = c('center', 'scale'),
lifesign = 'minimal',
threshold = 0.01)

Save Gradient Boosting Machine values obtained with Bootstrap

I am calculating the boosting gradient to identify the importance of variables in the model, however I am performing resampling to identify how the importance of each variable behaves.
But I can't correctly save the variable name with it's importance calculated in each bootstrap.
I'm doing this using a function, which is called within the bootstrap package
boost command.
Below is a minimally reproducible example adapted for AmesHousing data:
library(gbm)
library(boot)
library(AmesHousing)
df <- make_ames()
imp_gbm <- function(data, indices) {
d <- data[indices,]
gbm.fit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
return(summary(gbm.fit)[,2])
}
results_GBM <- boot(data = df,statistic = imp_gbm, R=100)
results_GBM$t0
I expect to save the bootstrap results with their variable names but I can only save the importance of variables without their names.
with summary.gbm, the default is to order the variables according to importance. you need to set it to FALSE, and also not plot. Then the returned variable importance is the same as the order of variables in the fit.
imp_gbm <- function(data, indices) {
d <- data[indices,]
# use gbmfit because gbm.fit is a function
gbmfit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
o= summary(gbmfit,plotit=FALSE,order=FALSE)[,2]
names(o) = gbmfit$var.names
return(o)
}

Train function in caret returns error message

I am using caret train() function to find an optimal cp value for a CART decision tree adopting as metric the F1 through a custom function. The train() function returns an error I can not understand. Perhaps the problem lies in the way I define the reproducible example?
> library(data.table)
> library(ROSE)
> data(hacide)
> train <- hacide.train
> test <- hacide.test
> numFolds = trainControl(method = "cv" , number = 10)
> cpGrid = expand.grid(.cp = seq(0.01, 0.5, 0.01))
> f1 <- function(data, lev = NULL, model = NULL) {
+ f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
+ c(F1 = f1_val)
+ }
> set.seed(12)
> train(cls ~ ., data = train,
+ method = "rpart",
+ tuneLength = 5,
+ metric = "F1",
+ trControl = trainControl(summaryFunction = f1,
+ classProbs = TRUE))
Error in train.default(x, y, weights = w, ...) :
At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
> levels(train$cls)
[1] "0" "1"
> class(train$cls)
[1] "factor"
You can try this :
levels(train$cls) <- make.names(levels(train$cls))
Then run your model this should fix your problem, Unfortunately your example is not reproducible as you missed out F1_Score function definition in your question. See if this works.
The below is working for me:
levels(train$cls) <- make.names(levels(train$cls))
set.seed(12)
train(cls ~ ., data = train,method = "rpart",tuneLength = 5,
metric = "ROC", trControl = trainControl(summaryFunction = twoClassSummary, classProbs = TRUE))

Resources