I'm trying to follow this link to create a custom SVM and run it through some cross-validations. My primary reason for this is to run Sigma, Cost and Epsilon parameters in my grid-search and the closest caret model (svmRadial) can only do two of those.
When I attempt to run the code below, I get the following error all over the place at every iteration of my grid:
Warning in eval(expr, envir, enclos) :
model fit failed for Fold1.: sigma=0.2, C=2, epsilon=0.1 Error in if (!isS4(modelFit) & !(method$label %in% c("Ensemble Partial Least Squares Regression", :
argument is of length zero
Even when I replicate the code from the link verbatim, I get a similar error and I'm not sure how to solve it. I found this link which goes through how the custom models are built and I see where this error is referenced, but still not sure what the issue is. I have my code below:
#Generate Tuning Criteria across Parameters
C <- c(1,2)
sigma <- c(0.1,.2)
epsilon <- c(0.1,.2)
grid <- data.frame(C,sigma)
#Parameters
prm <- data.frame(parameter = c("C", "sigma","epsilon"),
class = rep("numeric", 3),
label = c("Cost", "Sigma", "Epsilon"))
#Tuning Grid
svmGrid <- function(x, y, len = NULL) {
expand.grid(sigma = sigma,
C = C,
epsilon = epsilon)
}
#Fit Element Function
svmFit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
ksvm(x = as.matrix(x), y = y,
type = "eps-svr",
kernel = rbfdot,
kpar = list(sigma = param$sigma),
C = param$C,
epsilon = param$epsilon,
prob.model = classProbs,
...)
}
#Predict Element Function
svmPred <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata)
#Sort Element Function
svmSort <- function(x) x[order(x$C),]
#Model
newSVM <- list(type="Regression",
library="kernlab",
loop = NULL,
parameters = prm,
grid = svmGrid,
fit = svmFit,
predict = svmPred,
prob = NULL,
sort = svmSort,
levels = NULL)
#Train
tc<-trainControl("repeatedcv",number=2, repeats = 0,
verboseIter = T,savePredictions=T)
svmCV <- train(
Y~ 1
+ X1
+ X2
,data = data_nn,
method=newSVM,
trControl=tc
,preProc = c("center","scale"))
svmCV
After viewing the second link provided, I decided to try and include a label into the Model's parameters and that solved the issue! It's funny that it worked because the caret documentation says that value is optional, but if it works I can't complain.
#Model
newSVM <- list(label="My Model",
type="Regression",
library="kernlab",
loop = NULL,
parameters = prm,
grid = svmGrid,
fit = svmFit,
predict = svmPred,
prob = NULL,
sort = svmSort,
levels = NULL)
Related
I have a dataset with which I am doing k-folds cross-validation with.
In each fold, I have split the data into a train and test dataset.
For the training on the dataset X, I run the following code:
cv_glmnet <- caret::train(x = as.data.frame(X[curtrainfoldi, ]), y = y[curtrainfoldi, ],
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
I check the class of 'cv_glmnet', and 'train' is returned.
I then want to use this model to predict values in the test dataset, which is a matrix that has the same number of variables (columns)
# predicting on test data
yhat <- predict.train(cv_glmnet, newdata = X[curtestfoldi, ])
However, I keep running into the following error:
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt, type = "response") :
The number of variables in newx must be 210
I noticed in the caret.predict documentation, it states the following:
newdata an optional set of data to predict on. If NULL, then the
original training data are used but, if the train model used a recipe,
an error will occur.
I am confused as to why am I running into this error. Is it related to how I am defining newdata? My data has the right number of variables/columns (same as the train dataset), so I have no idea what is causing the error.
You get the error because your column names changes when you pass as.data.frame(X). If your matrix doesn't have column names, it creates column names and the model expects these when it tries to predict. If it has column names, then some of them could be changed :
library(caret)
library(tibble)
X = matrix(runif(50*20),ncol=20)
y = rnorm(50)
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
yhat <- predict.train(cv_glmnet, newdata = X)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt) :
The number of variables in newx must be 20
If you have column names, it works
colnames(X) = paste0("column",1:ncol(X))
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
yhat <- predict.train(cv_glmnet, newdata = X)
I'm having trouble with my custom training model in the caret package. I need to do a SVM regression and I want to find all the parameters of the SVM model - cost, sigma and epsilon. The built-in version has only cost and sigma. I have already found quite a helpful tip here and here but my model still does not work.
Error in models$grid(x = x, y = y, len = tuneLength, search = trControl$search) :
unused argument (search = trControl$search)
This error is the one I am getting and my code is here.
SVMrbf <- list(type = "Regression", library = "kernlab", loop = NULL)
prmrbf <- data.frame(parameters = data.frame(parameter = c('sigma', 'C', 'epsilon'),
class = c("numeric", "numeric", "numeric"),
label = c('Sigma', "Cost", "epsilon")))
SVMrbf$parameters <- prmrbf
svmGridrbf <- function(x, y, len = NULL) {
library(kernlab)
sigmas <- sigest(as.matrix(x), na.action = na.omit, scaled = TRUE, frac = 1)
expand.grid(sigma = mean(sigmas[-2]), epsilon = 10^(-5:0),
C = 2 ^(-5:len)) # len = tuneLength in train
}
SVMrbf$grid <- svmGridrbf
svmFitrbf <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
ksvm(x = as.matrix(x), y = y,
type = "eps-svr",
kernel = "rbfdot",
sigma = param$sigma,
C = param$C, epsilon = param$epsilon,
prob.model = classProbs,
...)
}
SVMrbf$fit <- svmFitrbf
svmPredrbf <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata)
SVMrbf$predict <- svmPredrbf
svmProb <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata, type="probabilities")
SVMrbf$prob <- svmProb
svmSortrbf <- function(x) x[order(x$C), ]
SVMrbf$sort <- svmSortrbf
svmRbfFit <- train(x = train.predictors1, y = train.response1, method = SVMrbf,
tuneLength = 10)
svmRbfFit
I could not find anyone, who had the same error and have no clue what is actually wrong. This code is pretty much just something I found online and slightly altered.
BTW this is my first post, so hopefully it's understandable, if not I can add additional info.
The solution is to include an argument search into your grid function, for example with
svmGridrbf <- function(x, y, len = NULL, search = "grid") {
library(kernlab)
sigmas <- sigest(as.matrix(x), na.action = na.omit, scaled = TRUE, frac = 1)
expand.grid(sigma = mean(sigmas[-2]), epsilon = 10^(-5:0), C = 2 ^(-5:len)) # len = tuneLength in train
}
If you look at the caret documentation for custom functions carefully, you'll see that caret wants you to specify how to select default parameters in case the user wants to do grid search and in case she wants to do random search (see "the grid element").
The error message tells you that caret passes an argument to the function which is not actually defined as an argument for that function.
This is probably easier to see here:
sd(x = c(1,2,3), a = 2)
# Error in sd(x = c(1, 2, 3), a = 2) : unused argument (a = 2)
Currently the caret train uses kernlab svm function under the hood and these are slow for my current purpose. But e1071 svm trainers offer a much needed speed boost. So I would like the cv procedure of caret with svm trainers of e1071. Is there any way to do that? Basically I want the svm engine of caret to be replaced by e1071 from the default kernlab.
I use the following code to train currently.
svm using kernlab
svmModel2 = train(factor(TopPick) ~. - Date , data = trainSet, method = 'svmRadial')
pred.svm2 = predict(svmModel2, testSet)
svm using e1071
svmModel = e1071::svm(factor(TopPick) ~ . - Date, data = trainSet)
pred.svm = predict(svmModel, testSet)
Thanks for the help.
As suggested in comment you can create your own custom model.
svmRadial2ModelInfo <- list(
label = "Support Vector Machines with Radial Kernel based on libsvm",
library = "e1071",
type = c("Regression", "Classification"),
parameters = data.frame(parameter = c("cost", "gamma"),
class = c("numeric", "numeric"),
label = c("Cost", "Gamma")),
grid = function(x, y, len = NULL, search = NULL) {
sigmas <- kernlab::sigest(as.matrix(x), na.action = na.omit, scaled = TRUE)
return( expand.grid(gamma = mean(as.vector(sigmas[-2])),
cost = 2 ^((1:len) - 3)) )
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
if(any(names(list(...)) == "probability") | is.numeric(y))
{
out <- svm(x = as.matrix(x), y = y,
kernel = "radial",
cost = param$cost,
gamma = param$gamma,
...)
} else {
out <- svm(x = as.matrix(x), y = y,
kernel = "radial",
cost = param$cost,
gamma = param$gamma,
probability = classProbs,
...)
}
out
},
predict = function(modelFit, newdata, submodels = NULL) {
predict(modelFit, newdata)
},
prob = function(modelFit, newdata, submodels = NULL) {
out <- predict(modelFit, newdata, probability = TRUE)
attr(out, "probabilities")
},
varImp = NULL,
predictors = function(x, ...){
out <- if(!is.null(x$terms)) predictors.terms(x$terms) else x$xNames
if(is.null(out)) out <- names(attr(x, "scaling")$x.scale$`scaled:center`)
if(is.null(out)) out <-NA
out
},
levels = function(x) x$levels,
sort = function(x) x[order(x$cost, -x$gamma),]
)
Usage:
svmR <- caret::train(x = trainingSet$x,
y = trainingSet$y,
trControl = caret::trainControl(number=10),
method = svmRadial2ModelInfo,
tuneLength = 3)
I have a tidy dataset with no missing values and only numeric columns.
The dataset is both large and contains sensitive information, so I won't be able to provide a copy of it here, unfortunately.
I partition this data into training and testing sets with caret's createDataPartition:
idx <- createDataPartition(y = model_final$y, p = 0.6, list = FALSE )
training <- model_final[idx,]
testing <- model_final[-idx,]
x <- training[-ncol(training)]
y <- training$y
x1 <- testing[-ncol(testing)]
y1 <- testing$y
row.names(training) <- NULL
row.names(testing) <- NULL
row.names(x) <- NULL
row.names(y) <- NULL
row.names(x1) <- NULL
row.names(y1) <- NULL
I've been fitting and refitting Random Forest models on this data via randomForest on a regular basis:
rf <- randomForest(x = x, y = y, mtry = ncol(x), ntree = 1000,
corr.bias = T, do.trace = T, nPerm = 3)
I decided to see if I could get any better or faster results with train and the following model ran fine, but took about 2 hours:
rf_train <- train(y=y, x=x,
method='rf', tuneLength = 3,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE
)
I need to take an HPC approach to make this logistically feasible, so I tried
require(doParallel)
registerDoParallel(cores = 8)
rf_train <- train(y=y, x=x,
method='parRF', tuneGrid = data.frame(mtry = 3), na.action = na.omit,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE, allowParallel = TRUE)
)
but regardless of if I use tuneLength or tuneGrid, this leads to strange errors about missing values and tuning parameters:
Error in train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
final tuning parameters could not be determined
In addition: Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
2: In train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
missing values found in aggregated results
I say this is weird both because there were no errors with method = "rf" and because I tripled checked to ensure there are no missing values.
I even get the same errors when completely omitting tuning options. I also tried toggling the na.action option on and off and changing "cv" to "repeatedcv".
I even get the same error with this ultra-simplified version:
rf_train <- train(y=y, x=x, method='parRF')
Seems to be because of a bug in caret. See the answer to:
parRF on caret not working for more than one core
Just dealt with this same issue, loading foreach on each new cluster manually seems to work.
I am trying to write an R function to run a weighted (optional) regressions, and I am having difficulties getting the weight variable to work.
Here is a simplified version of the function.
HC <- function(data, FUN, formula, tau = 0.5, weights = NULL){
if(is.null(weights)){
est <- FUN(data = data, formula = formula, tau = tau)
intercept = est$coef[["(Intercept)"]]
zeroWorker <- exp(intercept)
}
else {
est <- FUN(data = data, formula = formula, tau = tau, weights = weights)
intercept = est$coef[["(Intercept)"]]
zeroWorker <- exp(intercept)
}
return(zeroWorker)
}
The function works perfectly if I do not use the weights argument.
mod1 <- HC(data = mydata, formula = lin.model, tau = 0.2,
FUN = rq)
But, throws an error message when I use the weights argument.
mod2 <- HC(data = mydata, formula = lin.model, tau = 0.2,
FUN = rq, weights = weig)
I google the problem, and this post seems to be the closest to my problem, but I could still not get it to work. R : Pass argument to glm inside an R function.
Any help will be appreciated.
My problem can be replicated with:
library("quantreg")
data(engel)
mydata <- engel
mydata$weig <- with(mydata, log(sqrt(income))) # Create a fictive weigth variable
lin.model <- foodexp~income
mod1 <- HC(data = mydata, formula = lin.model, tau = 0.2,
FUN = rq) # This works perfectly
mod2 <- HC(data = mydata, formula = lin.model, tau = 0.2,
FUN = rq, weights = weig) # throws an error.
Error in HC(data = mydata, formula = lin.model, tau = 0.2, FUN = rq, weights = weig) :
object 'weig' not found
You have two problems. The error you're encountering is because you're trying to use the weigh variable without referencing it as coming from the mydata dataset. Try using mydata$weig. This will solve your first error, but you then get the actual one related to using the weights argument, which is:
Error in model.frame.default(formula = formula, data = data, weights = substitute(weights), :
invalid type (symbol) for variable '(weights)'
The solution is to add the variable specified in HC's weights argument to the dataframe before passing it to FUN:
HC <- function(data, FUN, formula, tau = 0.5, weights = NULL){
data$.weights <- weights
if(is.null(weights)){
est <- FUN(data = data, formula = formula, tau = tau)
} else {
est <- FUN(data = data, formula = formula, tau = tau, weights = .weights)
}
intercept = est$coef[["(Intercept)"]]
zeroWorker <- exp(intercept)
return(zeroWorker)
}
Then everything works:
mod2 <- HC(data = mydata, formula = lin.model, tau = 0.2, FUN = rq, weights = mydata$weig)
mod2
# [1] 4.697659e+47