How to conduct catboost grid search using GPU in R? - r

I'm setting up a grid search using the catboost package in R. Following the catboost documentation (https://catboost.ai/docs/), the grid search for hyperparameter tuning can be conducted using the 3 separate commands in R,
fit_control <- trainControl(method = "cv", number = 4, classProbs = TRUE)
grid <- expand.grid(depth = c(7,8,9,10), learning_rate = c(0.1,0.2,0.3,0.4), iterations = c(10,100,1000))
report <- train(df.scale, as.factor(make.names(as.matrix(tier1))), method = catboost.caret, logging_level = 'Verbose', preProc = NULL, tuneGrid = grid, trControl = fit_control)
searching across different values for depth, learning rate, and the number of iterations. These commands seem well enough, it's just I can't figure out where to input the option for the task_type = "GPU". Would appreciate any help on how to specify using the GPU for finding the optimal parameters using R.

It can be done the following way:
fit_control <- trainControl(method = "cv", number = 4, classProbs = TRUE)
grid <- expand.grid(depth = c(7,8,9,10), learning_rate = c(0.1,0.2,0.3,0.4), iterations = c(10,100,1000))
report <- train(df.scale, as.factor(make.names(as.matrix(tier1))), method = catboost.caret, logging_level = 'Verbose', preProc = NULL, tuneGrid = grid, trControl = fit_control,
task_type = "GPU")
This works due to ellipsis mechanics. All arguments that are unknown to caret.train itself are eventually passed to catboost.caret$fit and taken as training parameters for catboost. The exact place in catboost code where it happens is here:
...
catboost.caret$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
param <- c(param, list(...)) # all ellipsis args are taken to param
if (is.null(param$loss_function)) {
...
If you pass an unknown parameter this way, catboost will trigger an error:
report <- train(x, as.factor(make.names(y)),
method = catboost.caret,
logging_level = 'Verbose', preProc = NULL,
tuneGrid = grid, trControl = fit_control, what_is_this = "GPU")
> warnings()
Warning messages:
1: model fit failed for Fold1: depth=4, learning_rate=0.1, l2_leaf_reg=0.001, rsm=1, border_count=64, iterations=100 Error in catboost.train(pool, test_pool, param) :
catboost/private/libs/options/plain_options_helper.cpp:501: Unknown option {what_is_this} with value "GPU"

It looks like you are using the caret package to perform the training. In this case, it looks like the caret package does not pass any additional arguments to the catboost.train function so it may not support the GPU functionality. You can see from the code in caret for this method that the ... argument is not passed to the catboost.train function.
#' Fit model based on input data
#'
#' #param x, y: the current data used to fit the model
#' #param wts: optional instance weights (not applicable for this particular model)
#' #param param: the current tuning parameter values
#' #param lev: the class levels of the outcome (or NULL in regression)
#' #param last: a logical for whether the current fit is the final fit
#' #param weights: weights
#' #param classProbs: a logical for whether class probabilities should be computed
#'
#' #noRd
catboost.caret$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
param <- c(param, list(...))
if (is.null(param$loss_function)) {
param$loss_function <- "RMSE"
if (is.factor(y)) {
param$loss_function <- "Logloss"
if (length(lev) > 2) {
param$loss_function <- "MultiClass"
}
y <- as.double(y) - 1
}
}
test_pool <- NULL
if (!is.null(param$test_pool)) {
test_pool <- param$test_pool
if (class(test_pool) != "catboost.Pool")
stop("Expected catboost.Pool, got: ", class(test_pool))
param <- within(param, rm(test_pool))
}
pool <- catboost.from_data_frame(x, y, weight = wts)
model <- catboost.train(pool, test_pool, param)
model$lev <- lev
return(model)
}
Depending on your level of proficiency in R and caret, you can add your own model to caret by basically copying the model in the caret github location and modify it to accept the GPU argument which should go into the parameter list per their documentation

Related

Error: some required components are missing: prob?

I followed this guideline for creating my own caret model Creating Your Own Model. There it states that
If a regression model is being used or if the classification model
does not create class probabilities a value of NULL can be used here
instead of a function
and so I do that
# Define the model cFBasic
cFBasic <- list(type = "Regression",
library = c("lubridate", "stringr"),
loop = NULL)
...
cFBasic$prob <- NULL
cFBasic$sort <- NULL
However, when I attempt testing the model the following error is produced:
control <- trainControl(method = "cv",
number = 10,
p = .9,
allowParallel = TRUE)
fit <- train(x = calib_set,
y = calib_set$y,
method = cFBasic,
trControl = control)
Error: some required components are missing: prob
How can I fix that? other than adding the function prob to generate a fake pro data frame to make caret happy.
By typing cFBasic$prob <- NULL, you are not actually adding a new item to your list.
Look at this:
cFBasic <- list(prob = NULL)
cFBasic
#> $prob
#> NULL
cFBasic$prob <- NULL
cFBasic
#> named list()
When you assign NULL to an object of a list, you delete that object. If you want to add a NULL object called prob and one NULL object called sort to a list you should type this way:
# Define the model cFBasic
cFBasic <- list(type = "Regression",
library = c("lubridate", "stringr"),
loop = NULL)
...
cFBasic <- c(cFBasic, list(prob = NULL))
cFBasic <- c(cFBasic, list(sort = NULL))
Have a try.

error message lsmeans for beta mixed regression model with glmmTMB

I am analyzing the ratio of (biomass of one part of a plant community) vs. (total plant community biomass) across different treatments in time (i.e. repeated measures) in R. Hence, it seems natural to use beta regression with a mixed component (available with the glmmTMB package) in order to account for repeated measures.
My problem is about computing post hoc comparisons across my treatments with the function lsmeans from the lsmeans package. glmmTMB objects are not handled by the lsmeans function so Ben Bolker on recommended to add the following code before loading the packages {glmmTMB} and {lsmeans}:
recover.data.glmmTMB <- function(object, ...) {
fcall <- getCall(object)
recover.data(fcall,delete.response(terms(object)),
attr(model.frame(object),"na.action"), ...)}
lsm.basis.glmmTMB <- function (object, trms, xlev, grid, vcov.,
mode = "asymptotic", component="cond", ...) {
if (mode != "asymptotic") stop("only asymptotic mode is available")
if (component != "cond") stop("only tested for conditional component")
if (missing(vcov.))
V <- as.matrix(vcov(object)[[component]])
else V <- as.matrix(.my.vcov(object, vcov.))
dfargs = misc = list()
if (mode == "asymptotic") {
dffun = function(k, dfargs) NA
}
## use this? misc = .std.link.labels(family(object), misc)
contrasts = attr(model.matrix(object), "contrasts")
m = model.frame(trms, grid, na.action = na.pass, xlev = xlev)
X = model.matrix(trms, m, contrasts.arg = contrasts)
bhat = fixef(object)[[component]]
if (length(bhat) < ncol(X)) {
kept = match(names(bhat), dimnames(X)[[2]])
bhat = NA * X[1, ]
bhat[kept] = fixef(object)[[component]]
modmat = model.matrix(trms, model.frame(object), contrasts.arg = contrasts)
nbasis = estimability::nonest.basis(modmat)
}
else nbasis = estimability::all.estble
list(X = X, bhat = bhat, nbasis = nbasis, V = V, dffun = dffun,
dfargs = dfargs, misc = misc)
}
Here is my code and data:
trt=c(rep("T5",13),rep("T4",13),
rep("T3",13),rep("T1",13),rep("T2",13),rep("T1",13),
rep("T2",13),rep("T3",13),rep("T5",13),rep("T4",13))
year=rep(2005:2017,10)
plot=rep(LETTERS[1:10],each=13)
ratio=c(0.0046856237844411,0.00100861922394448,0.032516291436091,0.0136507743972955,0.0940240065096705,0.0141337428305094,0.00746709315018945,0.437009092691189,0.0708021091805216,0.0327952505849285,0.0192685194751524,0.0914696394299481,0.00281889216102303,0.0111928453399615,0.00188119596836005,NA,0.000874623692966351,0.0181192859074754,0.0176635391424644,0.00922358069727823,0.0525280029990213,0.0975006760149882,0.124726170684951,0.0187132600944396,0.00672592365451266,0.106399234215126,0.0401776844073239,0.00015382736648373,0.000293356424756535,0.000923659501144292,0.000897412901472504,0.00315930225856196,0.0636501228611642,0.0129422445492391,0.0143526630252398,0.0136775931834926,0.00159292971508751,0.0000322313783211749,0.00125352390811532,0.0000288862579879126,0.00590690336494395,0.000417043974238875,0.0000695808216192379,0.001301299696752,0.000209355138230326,0.000153151660178623,0.0000646279598274632,0.000596704590065324,9.52943306579156E-06,0.000113476446629278,0.00825405312309618,0.0001025984082064,0.000887617767039489,0.00273668802742924,0.00469409165130462,0.00312377000134233,0.0015579322817235,0.0582615988387306,0.00146933878743163,0.0405139497779372,0.259097955479886,0.00783997376383192,0.110638003652979,0.00454029511918275,0.00728290246595241,0.00104674197030363,0.00550563937846687,0.000121380392484705,0.000831904606687671,0.00475778829159394,0.000402799910756391,0.00259524300745195,0.000210249875492504,0.00550104485802363,0.000272849546913495,0.0025389089622392,0.00129370075116459,0.00132810234020792,0.00523285954007915,0.00506230599388357,0.00774104695265855,0.00098348404576587,0.174079173227248,0.0153486840317039,0.351820365452281,0.00347674458928481,0.147309225196026,0.0418825705903947,0.00591271021100856,0.0207139520537443,0.0563647804012055,0.000560012457272534,0.00191564842393647,0.01493480083524,0.00353400674061077,0.00771828473058641,0.000202009136938048,0.112695841130448,0.00761492172670762,0.038797330459115,0.217367765362878,0.0680958660605668,0.0100870294641921,0.00493875324236991,0.00136539944656238,0.00264262100866192,0.0847732305020654,0.00460985241335143,0.235802638543116,0.16336020383325,0.225776236687456,0.0204568107372349,0.0455390585228863,0.130969863489582,0.00679523322812889,0.0172325334280024,0.00299970176999806,0.00179347656925317,0.00721658257996989,0.00822443690003783,0.00913096724026346,0.0105920192618379,0.0158013204589482,0.00388803567197835,0.00366268607026078,0.0545418725650633,0.00761485067129418,0.00867583194858734,0.0188232707241144,0.018652666214789)
dat=data.frame(trt,year,plot,ratio)
require(glmmTMB)
require(lsmeans)
mod=glmmTMB(ratio~trt*scale(year)+(1|plot),family=list(family="beta",link="logit"),data=dat)
summary(mod)
ls=lsmeans(mod,pairwise~trt)`
Finally, I get the following error message that I've never encountered and on which I could find no information:
In model.matrix.default(trms, m, contrasts.arg = contrasts) :
variable 'plot' is absent, its contrast will be ignored
Could anyone shine their light? Thanks!
This is not an error message, it's a (harmless) warning message. It occurs because the hacked-up method I wrote doesn't exclude factor variables that are only used in the random effects. You should worry more about this output:
NOTE: Results may be misleading due to involvement in interactions
which is warning you that you are evaluating main effects in a model that contains interactions; you have to think about this carefully to make sure you're doing it right.

error tuning custom algorithm with caret r

I want to tune two parameters of my custom algorithm with caret. Un parameter (lambda) is numeric and the other parameter (prior) is character. This parameter can take two values "known" or "unknown". I've tuned the algorithm with just the lambda parameter. It's okay. But when I add the character parameter (prior) gives me the following error:
1: In eval(expr, envir, enclos) : model fit failed for Resample01:
lambda=1, prior=unknown Error in mdp(Class = y, data = x, lambda =
param$lambda, prior = param$prior, : object 'assignment' not found
the error must be related with the way to specify the character parameter (prior). Here is my code:
my_mod$parameters <- data.frame(
parameter = c("lambda","prior"),
class = c("numeric", "character"),
label = c("sample_length", "prior_type"))
## The grid Element
my_mod$grid <- function(x, y, len = NULL){expand.grid(lambda=1:2,prior=c("unknown", "known"))}
mygrid<-expand.grid(lambda=1:2,prior=c('unknown','known'))
## The fit Element
my_mod$fit <- function(x, y, wts, param, lev, last, classProbs, ...){
mdp(Class=y,data=x,lambda=param$lambda,prior=param$prior,info.pred ="yes")
}
## The predict Element
mdpPred <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict.mdp(modelFit, newdata)
my_mod$predict <- mdpPred
fitControl <- trainControl(method = "cv",number = 5,repeats = 5)
train(x=data, y = factor(Class),method = my_mod,trControl = fitControl, tuneGrid = mygrid)
That is because you must specify as.character(param$prior) in the fit function.

How to create model object in bagging?

I tried creating a function for Ensemble of Ensemble modelling:
library(foreach)
library(randomForest)
set.seed(10)
Y<-round(runif(1000))
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-c(1:1000)*runif(1000,min=0,max=2)
x3<-c(1:1000)*runif(1000,min=0,max=2)
all_data<-data.frame(Y,x1,x2,x3)
bagging = function(dataFile, length_divisor = 4, iterations = 100)
{
fit = list()
predictions = foreach(m = 1 : iterations, .combine = cbind) %do%
{
dataFile$Y = as.factor(dataFile$Y)
rf_fit = randomForest(Y ~ ., data = dataFile, ntree = 100)
fit[[m]] = rf_fit
rf_fit$votes[,2]
}
rowMeans(predictions)
return(list(formula = as.formula("Y ~ ."), trees = fit, ntree = 100, class = dataFile$Y, votes = predictions))
}
final_model = bagging(all_data)
predict(final_model, TestData) # It says predict doesn't support final_model object
# Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "list"
It says -
Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "list".
I need the above function bagging to return an aggregated model object so that I can predict on new data set.
Your bagging function just returns an arbitrary list. Predict looks to the class of the first parameter to know "the right thing" to do. I assume you want to predict from the randomForest objects stored inside the list? You can loop over your list with Map(). For example
Map(function(x) predict(x, TestData), final_model$trees)
(untested since you didn't seem to provide TestData)

How to incorporate logLoss in caret

I'm attempting to incorporate logLoss as the performance measure used when tuning randomForest (other classifiers) by way of caret (instead of the default options of Accuracy or Kappa).
The first R script executes without error using defaults. However, I get:
Error in { : task 1 failed - "unused argument (model = method)"
when using the second script.
The function logLoss(predict(rfModel,test[,-c(1,95)],type="prob"),test[,95]) works by way of leveraging a separately trained randomForest model.
The dataframe has 100+ columns and 10,000+ rows. All elements are numeric outside of the 9-level categorical "target" at col=95. A row id is located in col=1.
Unfortunately, I'm not correctly grasping the guidance provided by http://topepo.github.io/caret/training.html, nor having much luck via google searches.
Your help are greatly appreciated.
Working R script:
fitControl = trainControl(method = "repeatedcv",number = 10,repeats = 10)
rfGrid = expand.grid(mtry=c(1,9))
rfFit = train(target ~ ., data = train[,-1],method = "rf",trControl = fitControl,verbose = FALSE,tuneGrid = rfGrid)
Not working R script:
logLoss = function(data,lev=NULL,method=NULL) {
lLoss = 0
epp = 10^-15
for (i in 1:nrow(data)) {
index = as.numeric(lev[i])
p = max(min(data[i,index],1-epp),epp)
lLoss = lLoss - log(p)
}
lLoss = lLoss/nrow(data)
names(lLoss) = c("logLoss")
lLoss
}
fitControl = trainControl(method = "repeatedcv",number = 10,repeats = 10,summaryFunction = logLoss)
rfGrid = expand.grid(mtry=c(1,9))
rfFit = train(target ~ ., data = trainBal[,-1],method = "rf",trControl = fitControl,verbose = FALSE,tuneGrid = rfGrid)
I think you should set summaryFunction=mnLogLoss in trainControl and metric="logLoss" in train (I found it here). Like this:
# load libraries
library(caret)
# load the dataset
data(iris)
# prepare resampling method
control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(7)
fit <- train(Species~., data=iris, method="rf", metric="logLoss", trControl=control)
# display results
print(fit)
Your argument name is not correct (i.e. "unused argument (model = method)"). The webpage says that the last function argument should be called model and not method.

Resources