How to create model object in bagging? - r

I tried creating a function for Ensemble of Ensemble modelling:
library(foreach)
library(randomForest)
set.seed(10)
Y<-round(runif(1000))
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-c(1:1000)*runif(1000,min=0,max=2)
x3<-c(1:1000)*runif(1000,min=0,max=2)
all_data<-data.frame(Y,x1,x2,x3)
bagging = function(dataFile, length_divisor = 4, iterations = 100)
{
fit = list()
predictions = foreach(m = 1 : iterations, .combine = cbind) %do%
{
dataFile$Y = as.factor(dataFile$Y)
rf_fit = randomForest(Y ~ ., data = dataFile, ntree = 100)
fit[[m]] = rf_fit
rf_fit$votes[,2]
}
rowMeans(predictions)
return(list(formula = as.formula("Y ~ ."), trees = fit, ntree = 100, class = dataFile$Y, votes = predictions))
}
final_model = bagging(all_data)
predict(final_model, TestData) # It says predict doesn't support final_model object
# Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "list"
It says -
Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "list".
I need the above function bagging to return an aggregated model object so that I can predict on new data set.

Your bagging function just returns an arbitrary list. Predict looks to the class of the first parameter to know "the right thing" to do. I assume you want to predict from the randomForest objects stored inside the list? You can loop over your list with Map(). For example
Map(function(x) predict(x, TestData), final_model$trees)
(untested since you didn't seem to provide TestData)

Related

Create a multivariate matrix in tidymodels recipes::recipe()

I am trying to do a k-fold cross validation on a model that predicts the joint distribution of the proportion of tree species basal area from satellite imagery. This requires the use of the DiricihletReg::DirichReg() function, which in turn requires that the response variables be prepared as a matrix using the DirichletReg::DR_data() function. I originally tried to accomplish this in the caret:: package, but I found out that caret:: does not support multivariate responses. I have since tried to implement this in the tidymodels:: suite of packages. Following the documentation on how to register a new model in the parsnip:: (I appreciate Max Kuhn's vegetable humor) package, I created a "DREG" model and a "DR" engine. My registered model works when I simply call it on a single training dataset, but my goal is to do kfolds cross-validation, implementing the vfolds_cv(), a workflow(), and the 'fit_resample()' function. With the code I currently have I get warning message stating:
Warning message:
All models failed. See the `.notes` column.
Those notes state that Error in get(resp_char, environment(oformula)): object 'cbind(PSME, TSHE, ALRU2)' not found This, I believe is due to the use of DR_data() to preprocess the response variables into the format necessary for Dirichlet::DirichReg() to run properly. I think the solution I need to implement involve getting this pre-processing to happen in either the recipe() call or in the set_fit() call when I register this model with parsnip::. I have tried to use the step_mutate() function when specifying the recipe, but that performs a function on each column as opposed to applying the function with the columns as inputs. This leads to the following error in the "notes" from the output of fit_resample():
Must subset columns with a valid subscript vector.
Subscript has the wrong type `quosures`.
It must be numeric or character.
Is there a way to get the recipe to either transform several columns to a DirichletRegData class using the DR_data() function with a step_*() function or using the pre= argument in set_fit() and set_pred()?
Below is my reproducible example:
##Loading Necessary Packages##
library(tidymodels)
library(DirichletReg)
##Creating Fake Data##
set.seed(88)#For reproducibility
#Response variables#
PSME_BA<-rnorm(100,50, 15)
TSHE_BA<-rnorm(100,40,12)
ALRU2_BA<-rnorm(100,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA
#Predictor variables#
B1<-runif(100, 0, 2000)
B2<-runif(100, 0, 1800)
B3<-runif(100, 0, 3000)
#Dataset for modeling#
DF<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA,
B1=B1, B2=B2, B3=B3)
##Modeling the data using Dirichlet regression with repeated k-folds cross validation##
#Registering the model to parsnip::#
set_new_model("DREG")
set_model_mode(model="DREG", mode="regression")
set_model_engine("DREG", mode="regression", eng="DR")
set_dependency("DREG", eng="DR", pkg="DirichletReg")
set_model_arg(
model = "DREG",
eng = "DR",
parsnip = "param",
original = "model",
func = list(pkg = "DirichletReg", fun = "DirichReg"),
has_submodel = FALSE
)
DREG <-
function(mode = "regression", param = NULL) {
# Check for correct mode
if (mode != "regression") {
rlang::abort("`mode` should be 'regression'")
}
# Capture the arguments in quosures
args <- list(sub_classes = rlang::enquo(param))
# Save some empty slots for future parts of the specification
new_model_spec(
"DREG",
args=args,
eng_args = NULL,
mode = mode,
method = NULL,
engine = NULL
)
}
set_fit(
model = "DREG",
eng = "DR",
mode = "regression",
value = list(
interface = "formula",
protect = NULL,
func = c(pkg = "DirichletReg", fun = "DirichReg"),
defaults = list()
)
)
set_encoding(
model = "DREG",
eng = "DR",
mode = "regression",
options = list(
predictor_indicators = "none",
compute_intercept = TRUE,
remove_intercept = TRUE,
allow_sparse_x = FALSE
)
)
set_pred(
model = "DREG",
eng = "DR",
mode = "regression",
type = "numeric",
value = list(
pre = NULL,
post = NULL,
func = c(fun = "predict.DirichletRegModel"),
args =
list(
object = expr(object$fit),
newdata = expr(new_data),
type = "response"
)
)
)
##Running the Model##
DF$Y<-DR_data(DF[,c(1:3)]) #Preparing the response variables
dreg_spec<-DREG(param="alternative") %>%
set_engine("DR")
dreg_mod<-dreg_spec %>%
fit(Y~B1+B2+B3, data = DF)#Model works when simply run on single dataset
##Attempting Crossvalidation##
#First attempt - simply call Y as the response variable in the recipe#
kfolds<-vfold_cv(DF, v=10, repeats = 2)
rcp<-recipe(Y~B1+B2+B3, data=DF)
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
#second attempt - use step_mutate_at()#
rcp<-recipe(~B1+B2+B3, data=DF) %>%
step_mutate_at(fn=DR_data, var=vars(PSME, TSHE, ALRU2))
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
This works, but I'm not sure if it's what you were expecting.
First--getting the data setup for CV and DR_data()
I don't know of any package that has built what would essentially be a translation for CV and DirichletReg. Therefore, that part is manually done. You might be surprised to find it's not all that complicated.
Using the data you created and the modeling objects you created for tidymodels (those prefixed with set_), I created the CV structure that you were trying to use.
df1 <- data.frame(PSME = PSME_BA/Total_BA, TSHE = TSHE_BA/Total_BA,
ALRU2=ALRU2_BA/Total_BA, B1, B2, B3)
set.seed(88)
kDf2 <- kDf1 <- vfold_cv(df1, v=10, repeats = 2)
For each of the 20 subset data frames identified in kDf2, I used DR_data to set the data up for the models.
# convert to DR_data (each folds and repeats)
df2 <- map(1:20,
.f = function(x){
in_ids = kDf1$splits[[x]]$in_id
dd <- kDf1$splits[[x]]$data[in_ids, ] # filter rows BEFORE DR_data
dd$Y <- DR_data(dd[, 1:3])
kDf1$splits[[x]]$data <<- dd
})
Because I'm not all that familiar with tidymodels, next conducted the modeling using DirichReg. I then did it again with tidymodels and compared them. (The output is identical.)
DirichReg Models and summaries of the fits
set.seed(88)
# perform crossfold validation on Dirichlet Model
df2.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = DirichReg(Y ~ B1 + B2, daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
# summary of each fitted model
fit.a <- map(1:20,
.f = function(x){
summary(df2.fit[[x]]$fit)
})
tidymodels and summaries of the fits (the code looks the same, but there are a few differences--the output is the same, though)
# I'm not sure what 'alternative' is supposed to do here?
dreg_spec <- DREG(param="alternative") %>% # this is not model = alternative
set_engine("DR")
set.seed(88)
dfa.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = dreg_spec %>%
fit(Y ~ B1 + B2, data = daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
afit.a <- map(1:20,
.f = function(x){
summary(dfa.fit[[x]]$fit$fit) # extra nest for parsnip
})
If you wanted to see the first model?
fit.a[[1]]
afit.a[[1]]
If you wanted the model with the lowest AIC?
# comare AIC, BIC, and liklihood?
# what do you percieve best fit with?
fmin = min(unlist(map(1:20, ~fit.a[[.x]]$aic))) # dir
# find min AIC model number
paste0((map(1:20, ~ifelse(fit.a[[.x]]$aic == fmin, .x, ""))), collapse = "")
fit.a[[19]]
afit.a[[19]]

How to conduct catboost grid search using GPU in R?

I'm setting up a grid search using the catboost package in R. Following the catboost documentation (https://catboost.ai/docs/), the grid search for hyperparameter tuning can be conducted using the 3 separate commands in R,
fit_control <- trainControl(method = "cv", number = 4, classProbs = TRUE)
grid <- expand.grid(depth = c(7,8,9,10), learning_rate = c(0.1,0.2,0.3,0.4), iterations = c(10,100,1000))
report <- train(df.scale, as.factor(make.names(as.matrix(tier1))), method = catboost.caret, logging_level = 'Verbose', preProc = NULL, tuneGrid = grid, trControl = fit_control)
searching across different values for depth, learning rate, and the number of iterations. These commands seem well enough, it's just I can't figure out where to input the option for the task_type = "GPU". Would appreciate any help on how to specify using the GPU for finding the optimal parameters using R.
It can be done the following way:
fit_control <- trainControl(method = "cv", number = 4, classProbs = TRUE)
grid <- expand.grid(depth = c(7,8,9,10), learning_rate = c(0.1,0.2,0.3,0.4), iterations = c(10,100,1000))
report <- train(df.scale, as.factor(make.names(as.matrix(tier1))), method = catboost.caret, logging_level = 'Verbose', preProc = NULL, tuneGrid = grid, trControl = fit_control,
task_type = "GPU")
This works due to ellipsis mechanics. All arguments that are unknown to caret.train itself are eventually passed to catboost.caret$fit and taken as training parameters for catboost. The exact place in catboost code where it happens is here:
...
catboost.caret$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
param <- c(param, list(...)) # all ellipsis args are taken to param
if (is.null(param$loss_function)) {
...
If you pass an unknown parameter this way, catboost will trigger an error:
report <- train(x, as.factor(make.names(y)),
method = catboost.caret,
logging_level = 'Verbose', preProc = NULL,
tuneGrid = grid, trControl = fit_control, what_is_this = "GPU")
> warnings()
Warning messages:
1: model fit failed for Fold1: depth=4, learning_rate=0.1, l2_leaf_reg=0.001, rsm=1, border_count=64, iterations=100 Error in catboost.train(pool, test_pool, param) :
catboost/private/libs/options/plain_options_helper.cpp:501: Unknown option {what_is_this} with value "GPU"
It looks like you are using the caret package to perform the training. In this case, it looks like the caret package does not pass any additional arguments to the catboost.train function so it may not support the GPU functionality. You can see from the code in caret for this method that the ... argument is not passed to the catboost.train function.
#' Fit model based on input data
#'
#' #param x, y: the current data used to fit the model
#' #param wts: optional instance weights (not applicable for this particular model)
#' #param param: the current tuning parameter values
#' #param lev: the class levels of the outcome (or NULL in regression)
#' #param last: a logical for whether the current fit is the final fit
#' #param weights: weights
#' #param classProbs: a logical for whether class probabilities should be computed
#'
#' #noRd
catboost.caret$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
param <- c(param, list(...))
if (is.null(param$loss_function)) {
param$loss_function <- "RMSE"
if (is.factor(y)) {
param$loss_function <- "Logloss"
if (length(lev) > 2) {
param$loss_function <- "MultiClass"
}
y <- as.double(y) - 1
}
}
test_pool <- NULL
if (!is.null(param$test_pool)) {
test_pool <- param$test_pool
if (class(test_pool) != "catboost.Pool")
stop("Expected catboost.Pool, got: ", class(test_pool))
param <- within(param, rm(test_pool))
}
pool <- catboost.from_data_frame(x, y, weight = wts)
model <- catboost.train(pool, test_pool, param)
model$lev <- lev
return(model)
}
Depending on your level of proficiency in R and caret, you can add your own model to caret by basically copying the model in the caret github location and modify it to accept the GPU argument which should go into the parameter list per their documentation

Retrieving Variable Importance from Caret trained model with "lda2", "qda", "lda"

I can get variable importance out from "nnet" and "knn" models, but not from "lda", "lda2", and "qda".
I am using varImp(). I've tried everything I can think of and just can't get a proper idea of what the variable importance is.
Here is my code for training the model:
lda_model <- train(quality2 ~ .,
data = train_data,
method = "lda",
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 2),
importance = TRUE)
and here is the error I get when I try to check importance:
> varImp(lda_model)
Error in model.frame.default(formula = y ~ x, na.action = na.omit, drop.unused.levels = TRUE) :
invalid type (list) for variable 'y'
In addition: Warning messages:
1: In mean.default(y, rm.na = TRUE) :
argument is not numeric or logical: returning NA
2: In Ops.factor(left, right) : ‘-’ not meaningful for factors
I know this means it's treating it as an object class list instead of a trained model, and I've tried it on lda_model$finalmodel and others, but it's still not working.
How can I get proper feedback when using lda/qda on how my model is performing and which variables are performing best?
I had the same problem and it seems to come from the way of the dataset is imported in R. I first imported with the {readxl} package and varImp() didn't work. Then I tried to import throught the clipboard and now varImp is working on my lda model build with {caret}.
My code with {readxl} :
library(readxl)
glauc <- read_excel("Glaucome.xlsx", sheet="GlaucomaM")
rownames(glauc) <- glauc$IDENT
glauc$IDENT <- NULL
glauc$Class <- as.factor(glauc$Class)
library(caret)
numappr <- createDataPartition(glauc$Class, p=0.7)
appr <- glauc[numappr$Resample1,]
test <- glauc[-numappr$Resample1,]
Ctrl <- trainControl(summaryFunction=twoClassSummary,
classProbs=TRUE)
appr.lda <- train(Class~., data=appr, method="lda",
trControl=Ctrl, preProc = c("center","scale"),
metric="ROC")
varImp(appr.lda)
This leads to the same error message as yours.
Error: $ operator is invalid for atomic vectors
In addition: Warning messages:
1: In mean.default(y, rm.na = TRUE) :
argument is not numeric or logical: returning NA
2: In Ops.factor(left, right) : ‘-’ not meaningful for factors
And my code with read.table() and the clipboard :
glauc <- read.table("clipboard", header=T, sep="\t", dec=".")
rownames(glauc) <- glauc$IDENT
glauc$IDENT <- NULL
library(caret)
numappr <- createDataPartition(glauc$Class, p=0.7)
appr <- glauc[numappr$Resample1,]
test <- glauc[-numappr$Resample1,]
Ctrl <- trainControl(summaryFunction=twoClassSummary,
classProbs=TRUE)
appr.lda <- train(Class~., data=appr, method="lda",
trControl=Ctrl, preProc = c("center","scale"),
metric="ROC")
varImp(appr.lda)
This one leads to the result (only the first ones here):
varImp(appr.lda)
ROC curve variable importance
only 20 most important variables shown (out of 62)
Importance
vari 100.00
varg 97.14
vars 94.52
phci 93.69
hic 92.02
phcg 90.55
tms 89.96
Hope it helps.
Sophie

kernlab class prediction calculations failed => with SVMLinear NOT SVMRadial

I got a problem training SVMLinear with caret. The data works just fine with SVMRadial though.
The data is accessible via (29/05/2016):
https://www.dropbox.com/s/ia2vc25uhxdgqn1/projetTest01.txt?dl=0
(8000 lines of 1021 variables, ~10% target)
Here's the code:
projetTest01<-read.table("projetTest01.txt", sep="\t")
Test01<-list(data=projetTest01[,-c(2,3)],label=projetTest01[,3])
Test01N<-Test01
Test01N$label<-as.factor(Test01$label)
levels(Test01N$label)[levels(Test01N$label)=="0"] <- "No"
levels(Test01N$label)[levels(Test01N$label)=="1"] <- "Yes"
temp<-as.matrix(Test01$data)
storage.mode(temp) <- "numeric" #I need 'num' type
Test01N$data<-as.data.frame(temp)
svmTuneGrid_L <- data.frame(.C = 2^(-2:7))
trControl_SVML<-trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE)
svmFit_Lin <- train(Test01N$label ~ ., data = Test01N$data,method = "svmLinear",preProc = c("center", "scale"), tuneGrid = svmTuneGrid_L,trControl = trControl_SVML)
And I got these messages:
line search fails [..]
Warning in method$predict(modelFit = modelFit, newdata = newdata, submodels = param) :
kernlab class prediction calculations failed; returning NAs
Warning in data.frame(..., check.names = FALSE) :
row names were found from a short variable and have been discarded
I looked up the site/the web for some answers, but
the levels aren't numeric (=yes/no)
the ClassProb is set to TRUE
the labels can't be predicted perfectly from another variable (I know this from other algorithms)
there isn't a empty class
preproc(scale) or not doesn't make a difference
And the data works just fine with SVMRadial!!
I use caret 6.0-68
I really am at a loss. An idea someone?

How to incorporate logLoss in caret

I'm attempting to incorporate logLoss as the performance measure used when tuning randomForest (other classifiers) by way of caret (instead of the default options of Accuracy or Kappa).
The first R script executes without error using defaults. However, I get:
Error in { : task 1 failed - "unused argument (model = method)"
when using the second script.
The function logLoss(predict(rfModel,test[,-c(1,95)],type="prob"),test[,95]) works by way of leveraging a separately trained randomForest model.
The dataframe has 100+ columns and 10,000+ rows. All elements are numeric outside of the 9-level categorical "target" at col=95. A row id is located in col=1.
Unfortunately, I'm not correctly grasping the guidance provided by http://topepo.github.io/caret/training.html, nor having much luck via google searches.
Your help are greatly appreciated.
Working R script:
fitControl = trainControl(method = "repeatedcv",number = 10,repeats = 10)
rfGrid = expand.grid(mtry=c(1,9))
rfFit = train(target ~ ., data = train[,-1],method = "rf",trControl = fitControl,verbose = FALSE,tuneGrid = rfGrid)
Not working R script:
logLoss = function(data,lev=NULL,method=NULL) {
lLoss = 0
epp = 10^-15
for (i in 1:nrow(data)) {
index = as.numeric(lev[i])
p = max(min(data[i,index],1-epp),epp)
lLoss = lLoss - log(p)
}
lLoss = lLoss/nrow(data)
names(lLoss) = c("logLoss")
lLoss
}
fitControl = trainControl(method = "repeatedcv",number = 10,repeats = 10,summaryFunction = logLoss)
rfGrid = expand.grid(mtry=c(1,9))
rfFit = train(target ~ ., data = trainBal[,-1],method = "rf",trControl = fitControl,verbose = FALSE,tuneGrid = rfGrid)
I think you should set summaryFunction=mnLogLoss in trainControl and metric="logLoss" in train (I found it here). Like this:
# load libraries
library(caret)
# load the dataset
data(iris)
# prepare resampling method
control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(7)
fit <- train(Species~., data=iris, method="rf", metric="logLoss", trControl=control)
# display results
print(fit)
Your argument name is not correct (i.e. "unused argument (model = method)"). The webpage says that the last function argument should be called model and not method.

Resources