Understand grid in caret using R - r

What it means len in this code, this is the way in that caret build the grid, and i dont know how the len is selected.
function(x, y, len = NULL, search = "grid") {
if(search == "grid") {
out <- expand.grid(degree = seq(1, min(len, 3)),
scale = 10 ^((1:len) - 4),
tau = 2 ^((1:len) - 5))

Len is set when you call the train function. The first thing that happens is this:
tuneLength = ifelse(trControl$method == "none", 1, 3)
which in most cases will result in 3. Later on when tuneGrid is called this code is used:
tuneGrid <- models$grid(x = predict(ppObj, x), y = y,
len = tuneLength, search = trControl$search)
As you can see here len is filled with the value from tuneLength. Which is used to fill the len part of grid function of the model.

Related

Avoiding duplication in R

I am trying to fit a variety of (truncated) probability distributions to the same very thin set of quantiles. I can do it but it seems to require lots of duplication of the same code. Is there a neater way?
I am using this code by Nadarajah and Kotz to generate the pdf of the truncated distributions:
qtrunc <- function(p, spec, a = -Inf, b = Inf, ...)
{
tt <- p
G <- get(paste("p", spec, sep = ""), mode = "function")
Gin <- get(paste("q", spec, sep = ""), mode = "function")
tt <- Gin(G(a, ...) + p*(G(b, ...) - G(a, ...)), ...)
return(tt)
}
where spec can be the name of any untruncated distribution for which code in R exists, and the ... argument is used to provide the names of the parameters of that untruncated distribution.
To achieve the best fit I need to measure the distance between the given quantiles and those calculated using arbitrary values of the parameters of the distribution. In the case of the gamma distribution, for example, the code is as follows:
spec <- "gamma"
fit_gamma <- function(x, l = 0, h = 20, t1 = 5, t2 = 13){
ct1 <- qtrunc(p = 1/3, spec, a = l, b = h, shape = x[1],rate = x[2])
ct2 <- qtrunc(p = 2/3, spec, a = l, b = h, shape = x[1],rate = x[2])
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2- ct2)^2
return(sqrt(sum(dist)))
}
where l is the lower truncation, h is the higher and I am given the two tertiles t1 and t2.
Finally, I seek the best fit using optim, thus:
gamma_fit <- optim(par = c(2, 4),
fn = fit_gamma,
l = l,
h = h,
t1 = t1,
t2 = t2,
method = "L-BFGS-B",
lower = c(1.01, 1.4)
Now suppose I want to do the same thing but fitting a normal distribution instead. The names of the parameters of the normal distribution that I am using in R are mean and sd.
I can achieve what I want but only by writing a whole new function fit_normal that is extremely similar to my fit_gamma function but with the new parameter names used in the definition of ct1 and ct2.
The problem of duplication of code becomes very severe because I wish to try fitting a large number of different distributions to my data.
What I want to know is whether there is a way of writing a generic fit_spec as it were so that the parameter names do not have to be written out by me.
Use x as a named list to create a list of arguments to pass into qtrunc() using do.call().
fit_distro <- function(x, spec, l = 0, h = 20, t1 = 5, t2 = 13){
args <- c(x, list(spec = spec, a = l, b = h))
ct1 <- do.call(qtrunc, args = c(list(p = 1/3), args))
ct2 <- do.call(qtrunc, args = c(list(p = 2/3), args))
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2 - ct2)^2
return(sqrt(sum(dist)))
}
This is called as follows, which is the same as your original function.
fit_distro(list(shape = 2, rate = 3), "gamma")
# [1] 13.07425
fit_gamma(c(2, 3))
# [1] 13.07425
This will work with other distributions, for however many parameters they have.
fit_distro(list(mean = 10, sd = 3), "norm")
# [1] 4.08379
fit_distro(list(shape1 = 2, shape2 = 3, ncp = 10), "beta")
# [1] 12.98371

Defining a new model in crmPack and feeding it new data using "update" function

I've followed the code provided in the vignette of crmPack (pages 16-17) to define the one-parameter power model. I would need to feed it some toxicity data using the update function, but the error I get is "no slot of name "call"". Here is the code below. I would be very grateful for any comments or ideas in order to "fix" this. Many thanks.
# package and options
library(crmPack)
options <- McmcOptions(burnin = 1000, step = 2, samples = 5000)
set.seed(1)
# extra functions to define the power model
.OneParExp <- setClass(Class = "OneParExp", contains = "Model",
representation(skeletonFun = "function",
skeletonProbs = "numeric",
lambda = "numeric"))
OneParExp <- function(skeletonProbs, doseGrid, lambda)
{
skeletonFun <- approxfun(x = doseGrid, y = skeletonProbs, rule = 2)
invSkeletonFun <- approxfun(x = skeletonProbs, y = doseGrid, rule = 1)
.OneParExp(
skeletonFun = skeletonFun, skeletonProbs = skeletonProbs,
lambda = lambda,
datamodel = function(){
for (i in 1:nObs)
{
y[i] ~ dbern(p[i])
p[i] <- skeletonProbs[xLevel[i]]^theta
}},
datanames = c("nObs", "y", "xLevel"),
prob = function(dose, theta){ skeletonFun(dose)^theta },
dose = function(prob, theta){ invSkeletonFun(prob^(1 / theta)) },
priormodel = function(){ theta ~ dexp(lambda) },
modelspecs = function(){ list(skeletonProbs = skeletonProbs,
lambda = lambda) },
init = function(){ list(theta = 1) }, sample = "theta")
}
# tox data and model fitting
data <- Data(x = c(1.2,1.2,1.8,2.4,3),
y = c(0, 0, 0, 1, 1),
cohort = c(1, 1, 2, 3, 4),
doseGrid = seq(1.2, 3, 0.6),
ID = 1:5,
placebo = FALSE)
(skeletonProbs <- round(data#doseGrid / max(data#doseGrid) / 4, 2))
newModel <- OneParExp(skeletonProbs = skeletonProbs,
doseGrid = data#doseGrid, lambda = 1)
newDLTmodel <- update(object=newModel, data=data)
You don't use the "update" function here to feed data to the model. ("update" methods are primarily internal methods to update "Data" objects in crmPack.) Instead, you use "mcmc" to estimate parameters given a model and data:
estimates <- mcmc(model=newModel, data=data, options=McmcOptions())
plot(estimates, newModel, data)

Custom classification threshold for GBM

I'm trying to create a custom GBM model that tunes the classification threshold for a binary classification problem. There is a nice example provided on the caret website here, but when I try to apply something similar to GBM I receive the following error:
Error in { : task 1 failed - "argument 1 is not a vector"
Unfortunately, I have no idea where the error is and the error isn't very helpful.
Here's an example, with the code that I've used for defining the custom GBM
library(caret)
library(gbm)
library(pROC)
#### DEFINE A CUSTOM GBM MODEL FOR PROBABILITY THRESHOLD TUNING ####
## Get the model code for the original gbm method from caret
customGBM <- getModelInfo("gbm", regex = FALSE)[[1]]
customGBM$type <- c("Classification")
## Add the threshold (i.e. class cutoff) as another tuning parameter
customGBM$parameters <- data.frame(parameter = c("n.trees", "interaction.depth", "shrinkage",
"n.minobsinnode", "threshold"),
class = rep("numeric", 5),
label = c("# Boosting Iterations", "Max Tree Depth", "Shrinkage",
"Min. Terminal Node Size", "Probability Cutoff"))
## Customise the tuning grid:
## Some paramters are fixed. Will give a tuning grid of 2,500 values if len = 100
customGBM$grid <- function(x, y, len = NULL, search = "grid") {
if (search == "grid") {
grid <- expand.grid(n.trees = seq(50, 250, 50),
interaction.depth = 2, ### fix interaction depth at 2
shrinkage = 0.0001, ### fix learning rate at 0.0001
n.minobsinnode = seq(2, 10, 2),
threshold = seq(.01, .99, length = len))
} else {
grid <- expand.grid(n.trees = floor(runif(len, min = 1, max = 5000)),
interaction.depth = sample(1:10, replace = TRUE, size = len),
shrinkage = runif(len, min = .001, max = .6),
n.minobsinnode = sample(5:25, replace = TRUE, size = len),
threshold = runif(1, 0, size = len))
grid <- grid[!duplicated(grid),] ### remove any duplicated rows in the training grid
}
grid
}
## Here we fit a single gbm model and loop over the threshold values to get predictions from the
## same gbm model.
customGBM$loop = function(grid) {
library(plyr)
loop <- ddply(grid, c("n.trees", "shrinkage", "interaction.depth", "n.minobsinnode"),
function(x) c(threshold = max(x$threshold)))
submodels <- vector(mode = "list", length = nrow(loop))
for (i in seq(along = loop$threshold)) {
index <- which(grid$n.trees == loop$n.trees[i] &
grid$interaction.depth == loop$interaction.depth[i] &
grid$shrinkage == loop$shrinkage[i] &
grid$n.minobsinnode == loop$n.minobsinnode[i])
cuts <- grid[index, "threshold"]
submodels[[i]] <- data.frame(threshold = cuts[cuts != loop$threshold[i]])
}
list(loop = loop, submodels = submodels)
}
## Fit the model independent of the threshold parameter
customGBM$fit = function(x, y, wts, param, lev, last, classProbs, ...) {
theDots <- list(...)
if (any(names(theDots) == "distribution")) {
modDist <- theDots$distribution
theDots$distribution <- NULL
} else {
if (is.numeric(y)) {
stop("This works only for 2-class classification problems")
} else modDist <- if (length(lev) == 2) "bernoulli" else
stop("This works only for 2-class classification problems")
}
# if (length(levels(y)) != 2)
# stop("This works only for 2-class problems")
## check to see if weights were passed in (and availible)
if (!is.null(wts)) theDots$w <- wts
if (is.factor(y) && length(lev) == 2) y <- ifelse(y == lev[1], 1, 0)
modArgs <- list(x = x,
y = y,
interaction.depth = param$interaction.depth,
n.trees = param$n.trees,
shrinkage = param$shrinkage,
n.minobsinnode = param$n.minobsinnode,
distribution = modDist)
do.call("gbm.fit", modArgs)
}
## Now get a probability prediction and use different thresholds to
## get the predicted class
customGBM$predict = function(modelFit, newdata, submodels = NULL) {
out <- predict(modelFit, newdata, n.trees = modelFit$tuneValue$n.trees,
type = "response")#[, modelFit$obsLevels[1]]
out[is.nan(out)] <- NA
class1Prob <- ifelse(out >= modelFit$tuneValue$threshold,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
## Raise the threshold for class #1 and a higher level of
## evidence is needed to call it class 1 so it should
## decrease sensitivity and increase specificity
out <- ifelse(class1Prob >= modelFit$tuneValue$threshold,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
if (!is.null(submodels)) {
tmp2 <- out
out <- vector(mode = "list", length = length(submodels$threshold))
out[[1]] <- tmp2
for (i in seq(along = submodels$threshold)) {
out[[i + 1]] <- ifelse(class1Prob >= submodels$threshold[[i]],
modelFit$obsLevels[1],
modelFit$obsLevels[2])
}
}
out
}
## The probabilities are always the same but we have to create
## mulitple versions of the probs to evaluate the data across
## thresholds
customGBM$prob = function(modelFit, newdata, submodels = NULL) {
out <- predict(modelFit, newdata, type = "response",
n.trees = modelFit$tuneValue$n.trees)
out[is.nan(out)] <- NA
out <- cbind(out, 1 - out)
colnames(out) <- modelFit$obsLevels
if (!is.null(submodels)) {
tmp <- predict(modelFit, newdata, type = "response", n.trees = submodels$n.trees)
tmp <- as.list(as.data.frame(tmp))
lapply(tmp, function(x, lvl) {
x <- cbind(x, 1 - x)
colnames(x) <- lvl
x}, lvl = modelFit$obsLevels)
out <- c(list(out), tmp)
}
out
}
fourStats <- function (data, lev = levels(data$obs), model = NULL) {
## This code will get use the area under the ROC curve and the
## sensitivity and specificity values using the current candidate
## value of the probability threshold.
out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
## The best possible model has sensitivity of 1 and specificity of 1.
## How far are we from that value?
coords <- matrix(c(1, 1, out["Spec"], out["Sens"]),
ncol = 2,
byrow = TRUE)
colnames(coords) <- c("Spec", "Sens")
rownames(coords) <- c("Best", "Current")
c(out, Dist = dist(coords)[1])
}
And then some code showing how to use the custom model
set.seed(949)
trainingSet <- twoClassSim(500, -9)
mod1 <- train(Class ~ ., data = trainingSet,
method = customGBM, metric = "Dist",
maximize = FALSE, tuneLength = 10,
trControl = trainControl(method = "cv", number = 5,
classProbs = TRUE,
summaryFunction = fourStats))
The model appears to run, but finishes with the error from above. If someone could please help me with customising the GBM model to tune the GBM parameters, and the probability threshold for the classes that would be great.

Error with tuning custom SVM model in caret

I'm having trouble with my custom training model in the caret package. I need to do a SVM regression and I want to find all the parameters of the SVM model - cost, sigma and epsilon. The built-in version has only cost and sigma. I have already found quite a helpful tip here and here but my model still does not work.
Error in models$grid(x = x, y = y, len = tuneLength, search = trControl$search) :
unused argument (search = trControl$search)
This error is the one I am getting and my code is here.
SVMrbf <- list(type = "Regression", library = "kernlab", loop = NULL)
prmrbf <- data.frame(parameters = data.frame(parameter = c('sigma', 'C', 'epsilon'),
class = c("numeric", "numeric", "numeric"),
label = c('Sigma', "Cost", "epsilon")))
SVMrbf$parameters <- prmrbf
svmGridrbf <- function(x, y, len = NULL) {
library(kernlab)
sigmas <- sigest(as.matrix(x), na.action = na.omit, scaled = TRUE, frac = 1)
expand.grid(sigma = mean(sigmas[-2]), epsilon = 10^(-5:0),
C = 2 ^(-5:len)) # len = tuneLength in train
}
SVMrbf$grid <- svmGridrbf
svmFitrbf <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
ksvm(x = as.matrix(x), y = y,
type = "eps-svr",
kernel = "rbfdot",
sigma = param$sigma,
C = param$C, epsilon = param$epsilon,
prob.model = classProbs,
...)
}
SVMrbf$fit <- svmFitrbf
svmPredrbf <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata)
SVMrbf$predict <- svmPredrbf
svmProb <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata, type="probabilities")
SVMrbf$prob <- svmProb
svmSortrbf <- function(x) x[order(x$C), ]
SVMrbf$sort <- svmSortrbf
svmRbfFit <- train(x = train.predictors1, y = train.response1, method = SVMrbf,
tuneLength = 10)
svmRbfFit
I could not find anyone, who had the same error and have no clue what is actually wrong. This code is pretty much just something I found online and slightly altered.
BTW this is my first post, so hopefully it's understandable, if not I can add additional info.
The solution is to include an argument search into your grid function, for example with
svmGridrbf <- function(x, y, len = NULL, search = "grid") {
library(kernlab)
sigmas <- sigest(as.matrix(x), na.action = na.omit, scaled = TRUE, frac = 1)
expand.grid(sigma = mean(sigmas[-2]), epsilon = 10^(-5:0), C = 2 ^(-5:len)) # len = tuneLength in train
}
If you look at the caret documentation for custom functions carefully, you'll see that caret wants you to specify how to select default parameters in case the user wants to do grid search and in case she wants to do random search (see "the grid element").
The error message tells you that caret passes an argument to the function which is not actually defined as an argument for that function.
This is probably easier to see here:
sd(x = c(1,2,3), a = 2)
# Error in sd(x = c(1, 2, 3), a = 2) : unused argument (a = 2)

Error with custom SVM model for tuning in caret

I'm trying to follow this link to create a custom SVM and run it through some cross-validations. My primary reason for this is to run Sigma, Cost and Epsilon parameters in my grid-search and the closest caret model (svmRadial) can only do two of those.
When I attempt to run the code below, I get the following error all over the place at every iteration of my grid:
Warning in eval(expr, envir, enclos) :
model fit failed for Fold1.: sigma=0.2, C=2, epsilon=0.1 Error in if (!isS4(modelFit) & !(method$label %in% c("Ensemble Partial Least Squares Regression", :
argument is of length zero
Even when I replicate the code from the link verbatim, I get a similar error and I'm not sure how to solve it. I found this link which goes through how the custom models are built and I see where this error is referenced, but still not sure what the issue is. I have my code below:
#Generate Tuning Criteria across Parameters
C <- c(1,2)
sigma <- c(0.1,.2)
epsilon <- c(0.1,.2)
grid <- data.frame(C,sigma)
#Parameters
prm <- data.frame(parameter = c("C", "sigma","epsilon"),
class = rep("numeric", 3),
label = c("Cost", "Sigma", "Epsilon"))
#Tuning Grid
svmGrid <- function(x, y, len = NULL) {
expand.grid(sigma = sigma,
C = C,
epsilon = epsilon)
}
#Fit Element Function
svmFit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
ksvm(x = as.matrix(x), y = y,
type = "eps-svr",
kernel = rbfdot,
kpar = list(sigma = param$sigma),
C = param$C,
epsilon = param$epsilon,
prob.model = classProbs,
...)
}
#Predict Element Function
svmPred <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata)
#Sort Element Function
svmSort <- function(x) x[order(x$C),]
#Model
newSVM <- list(type="Regression",
library="kernlab",
loop = NULL,
parameters = prm,
grid = svmGrid,
fit = svmFit,
predict = svmPred,
prob = NULL,
sort = svmSort,
levels = NULL)
#Train
tc<-trainControl("repeatedcv",number=2, repeats = 0,
verboseIter = T,savePredictions=T)
svmCV <- train(
Y~ 1
+ X1
+ X2
,data = data_nn,
method=newSVM,
trControl=tc
,preProc = c("center","scale"))
svmCV
After viewing the second link provided, I decided to try and include a label into the Model's parameters and that solved the issue! It's funny that it worked because the caret documentation says that value is optional, but if it works I can't complain.
#Model
newSVM <- list(label="My Model",
type="Regression",
library="kernlab",
loop = NULL,
parameters = prm,
grid = svmGrid,
fit = svmFit,
predict = svmPred,
prob = NULL,
sort = svmSort,
levels = NULL)

Resources